Free/open source information retrieval libraries

What are they and why using one

Information retrieval libraries are software libraries that provides functionality for searching within databases and documents within them. In particular, this often refers to searching text document for combinations of words and/or phrases.

Database software such as MySQL are very good at storing and managing data but many of them are poor at searching text. One can bundle an information retrieval library with database software to add good searching capability to databases. Take websites for an example. The main database can handle insert/update/delete/view operations of articles and the search library can handle the in-site search engine.

The choices

MySQL

Although mainly a database server, MySQL actually has built-in full-text search functions.

How it runs

Full-text search is different from “LIKE” and “RLIKE” operations. For the former, MySQL breaks down the text into words, removes stopwords and operates at the word level. The columns can be full-text searched by MATCH() … AGAINST syntax. See MySQL’s doc for details.

Supported queries

Word and phrase search
Word prefix search (word*)
Simple boosting (marking a word/phrase as having an increased/decreased contribution)
Boolean operations (AND, OR, NOT)
Default or customizable stopword list
Sorting (scores are returned as floating-point numbers so the result can be sorted by that “field” using ORDER BY)

Data storage

For MyISAM tables, full-text indices can be created for CHAR, VARCHAR or TEXT columns. A column can be full-text indexed by creating an index of it with the type FULLTEXT. MySQL creates, updates and stores the full-text indices automatically in the .MYI file along with other indices.

Performance

Indexing is fast (~4 minutes for about 30000 documents). However, searching is slow, as each query takes seconds (from less than 1 second up to about 35 seconds in our test).

Resource usage

Full-text search is part of MySQL server so mearsuring its resource usage is hard. We haven’t noticed increased memory usage when full-text queries are used though.

Sphinx

Sphinx is relatively new and written by a single programmer (Andrew Aksyonoff). Features are added in an ad-hoc matter and are limited. But it’s very fast (probably faster than anything else out there) for both indexing and searching. For a majority of document-retrieval systems (e.g. websites, knowledge bases), Sphinx probably has everything they need.

How it runs

Sphinx is written in C++. Other programs can link against Sphinx and use its functions. Alternatively, Sphinx provides a search daemon (searchd). Programs can talk to it and query against its indices using socket communication.

There are APIs for PHP (standard), Ruby (see UltraSphinx) and Java (standard). The APIs provide functions in these languages that when called, serialize and send the queries to searchd (and then receives answers and unserialize the result).

There is a plug-in (storage engine to be exact) for MySQL 5 called SphinxSE. You can query against a Sphinx index just like querying a MySQL database.

There is no plug-in architecture within Sphinx yet. If you want to extending its functionality, you have to modify the source and re-compile.

Supported queries

Word and phrase search
Boolean operations (AND, OR, NOT), although nesting is not supported
Word-word proximity search (i.e. a number of words within a distance of each other)
Multi-field query
Grouping of records by date on fields that are defined as UNIX timestamps (like SQL’s GROUP BY)
Sorting by fields, ascending or descending, single or multi-field

Data storage

Sphinx stores integers (including UNIX timestamps, which Sphinx has special support for). Although Sphinx indexes strings, tokenized or not, it does not store them. Therefore, the documents have to be stored in MySQL. The work flow for searching is like this:

Translate a system query into a Sphinx query.
Give Sphinx the query and get back a list of document IDs.
Ask MySQL for the document records with the given document IDs.
Process the document records.

Sphinx supports updating the index, but according to the doc, it’s not faster than rebuilding the index. An alternative is to use a “main+delta” scheme, as described in the doc.

Performance

Both indexing and searching are very fast. In our case, it indexes 40000 documents (1 GB) in about 3 minutes. Queries are executed in the order of 10 ms each (although some are in the 500ms range). Queries seem to be cached and repeated queries are executed even faster.

Resource usage

Sphinx is easy on system resource. You can specify how much memory the indexer uses. The default (32 MB) satisfies our need. The search daemon uses about 10 MB of memory. CPU usage is also very low.

Lucene

Lucene is older and considered more mature. It is used in many big projects such as Wikipedia. It might take more time to fine-tune (partly because there are so many ways to do things). If you need the extra functionality, Lucene is worth trying.

Lucene In Action is a must-read if you want to use this library.

How it runs

Lucene is embedded inside the programs that use it (think Berkeley DB).

There are ports of Lucene to PHP, Ruby, C, C++, C#, Delphi and others. See Lucene’s website for details. Ports usually lag behind the main Lucene project. To use the main project with other programming languages, refer to PHP/Java Bridge (which has been renamed to VMBridge and is adding support for other languages).

For ready-to-use search-daemon-like service, refer to "Solr:http://lucene.apache.org/solr/ , which is “open source enterprise search server based on the Lucene search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, and a web administration interface.”

Lucene does not do specific processing such as keyword-highlighting or web-crawling. Related projects (e.g. Solr, Nutch) fill in these gaps.

Supported queries

Lucene provides the “building-blocks” for a search system. In other words, you build the system yourself, but you can build it any way you want. For example, the several primitive types of queries are sufficient to build about any kind of query you want.

Word and phrase search
Boolean operations (AND, OR, NOT) with or without nesting
Span queries (a span is a [document, starting location, ending location] tuple) that can build proximity search and more
Multi-field query
Grouping is not build-in but you can build it yourself
Sorting by scores, by field values or using your own function

The biggest gain of building thing at this low level is flexibilty (of building whatever you like) and the additional information about the documents and the searh process that other “ready-to-run” libraries don’t give you. For example:

In addition to TermDocs ([term, document] tuples grouped by term), you have access to TermPositions ([term, document, position] tuples grouped by term) and term vectors (document, term, position] tuples grouped by document). Such information is very useful for doing statistics things fast (from word count to something more complex).
You can control how Lucene selects (using Query and Filter classes) and scores (using Scorer and Similarity classes) the documents.
You can trace how Lucene selects and scores the documents (using the Explanation class).

Data storage

Lucene supports storing and/or indexing and/or tokenizing text. Note that it treats everything as strings.

In theory, it can replace the database and still perform well on many kinds of applications. Not sure how it would perform though.

Performance

Indexing is moderately fast (about 8 minutes for the same 40000-document database) Query time varies greatly from 10ms to about 500ms. Repeated queries also take less time.

Resource usage

Lucene is more resource-intensive than Sphinx, especially on the memory side. You will have to measure the difference youself. Our Lucene service (Lucene loaded by PHP/Java bridge) process is about 100 MB all the time, though we haven’t tested how much of that memory is actually used. Java VMs usually have large startup-lags, so starting a “service” is much better than starting a new Java VM every time. However, watch for “memory leak” problem if you are going to do that. Prepared to spend some time tuning GC behavior too.

How do I update root certificates in Apache/PHP/cURL environment

What are the differences between addslashes(), mysql_escape_string() and mysql_real_escape_string()

PHP and ODBC

Speed of unpack() in PHP

Configuring PEAR on Windows

How do I use cURL in PHP on Windows?

Passing command-line arguments into PHP

Using SSL socket in PHP under Windows

PHP Resources

PHP error reporting

PHPXref vs PHPDocumentor

Create a PHP unit test case using SimpleTest

PHP Commenting Style

phpMyAdmin Security

PHP

How can I make phpMyAdmin avoid sending MySQL passwords in the clear?

PHP ODBC Setup Guide

Performance of array_shift and array_pop in PHP

Javascript

Numeric Validation JavaScript

The advantages of Javascript

jQuery Tutorial

Which Javascript framework should I use?

jQuery and JavaScript Coding: Examples and Best Practices

Java

XML and Java

Converting Java content into AJAX (Javascript and XML)

Create a Java class that is only comparable to itself

Removing old Java versions

Java Server Faces

MySQL Resources

PostgreSQL Resources

Why doesn't mysqlshow work for databases or tables with underscores in their names?

What is mysqlshow good for?

How can I search/replace strings in MySQL?

Microsoft Access, OpenOffice and MySQL

SQL joins

Get rid of default annoyances in MySQL Workbench

Who uses PostgreSQL at UCLA?

Why NoSQL Matters

Subversion

Revision Control

Revision Control Systems Compared

Installing Subversion on Windows

GIT info

What are some document management services/document version control applications out there?

svn: Working copy '<filename>' is missing or not locked

Learning about CSS

What sort of menus can I make with CSS?

Top Ten Web Design Mistakes of 2005

The importance of "!important" in CSS

CSS Design Concerns for IE6, IE7, and Firefox

Forcing a page break with CSS

What's a solid starting point (global reset) for a CSS file?

UX Team ( UCLA Library - Digital Initiatives & Information Technology )

Hi, are there any UCLA style resources or style guides for websites?

UX Resources

What to do when CSS stylesheets refuse to apply

Web Accessibility Resources

Sass versus LESS

Introduction to XML in Flash - Making Flash Dynamic

XML Resources

XML

Why is it important to use short names in Plone?

Plone CMS Resources

Plone 4 Tips and Tricks: Table of Contents

How do I identify the stylesheets in Plone?

How to get rid of icons in Plone

Importing and exporting a Plone site

Installing Plone v3.2 on Mac OS X 10.5

Remove highlighting of search terms in Plone

Is there a permission that allows a user edit content that s/he does not own in Plone?

Why can't I add a photo using AT Photo in Plone?

Shibboleth For Plone

How do I get started with designing new/existing layouts in Plone?

Backing up and packing Plone's database file (Data.fs)

Zope/Plone usage statistics

Should I use plonecustom.css when changing the layout for my Plone site

Changing number of displayed news/events in Plone portlets

Search across multiple Plone instances