Free/open source information retrieval libraries

What are they and why using one

Information retrieval libraries are software libraries that provides functionality for searching within databases and documents within them. In particular, this often refers to searching text document for combinations of words and/or phrases.

Database software such as MySQL are very good at storing and managing data but many of them are poor at searching text. One can bundle an information retrieval library with database software to add good searching capability to databases. Take websites for an example. The main database can handle insert/update/delete/view operations of articles and the search library can handle the in-site search engine.

The choices

Sphinx

~~It’s~~Sphinx is relatively new and written by a single programmer (Andrew Aksyonoff) in Russia. Features are added in an ad-hoc matter and are limited. But it’s very fast (probably faster than anything else out there) for both indexing and searching. For a majority of document-retrieval systems (e.g. websites, knowledge bases), Sphinx probably has everything they need.

How it runs

Sphinx is written in C++. Other programs can link against Sphinx and use its functions. Alternatively, Sphinx provides a search daemon (searchd). Programs can talk to it and query against its indices using socket communication.

There are APIs for PHP (standard), Ruby (see UltraSphinx) and ~~Java.~~Java (standard). The APIs ~~provides~~provide functions in these languages that when called, serialize and send the queries to searchd (and then receives answers and unserialize the result).

There is a plug-in (storage engine to be exact) for MySQL 5.5 called SphinxSE. You can query against a Sphinx index just like querying a MySQL database.

There is no plug-in architecture within Sphinx yet. If you want to extending its functionality, you have to modify the source and re-compile.

Supported queries

Word and phrase search
Boolean operations (AND, OR, NOT), although nesting is not supported
Word-word proximity search (i.e. a number of words within a distance of each other)
Multi-field query
Grouping of records by date on fields that are defined as UNIX timestamps (like SQL’s GROUP BY)
Sorting by fields, ascending or descending, single or multi-field

Data storage

Sphinx stores integers (including UNIX timestamps, which Sphinx has special support for). Although Sphinx indexes strings, tokenized or not, it does not store them. Therefore, the documents have to be stored in MySQL. The work flow for searching is like this:

Translate a system query into a Sphinx query.
Give Sphinx the query and get back a list of document IDs.
Ask MySQL for the document records with the given document IDs.
Process the document records.

Sphinx supports updating the index, but according to the doc, it’s not faster than ~~rebuild~~rebuilding the index. An alternative is to use a “main+delta” scheme, as described in the doc.

Performance

Both indexing and searching are very fast. In our case, it indexes 40000 documents (1 GB) in about 3 minutes. Queries are executed in the order of 10 ms ~~each.~~each (although some are in the 500ms range). Queries seem to be cached and repeated queries are executed even faster.

Resource usage

Sphinx is easy on system resource. You can specify how much memory the indexer uses. The default (32 MB) satisfies our need. The search daemon uses about 10 MB of memory. CPU usage is also very low.

Lucene

Lucene is older and considered more mature. It is used in many big projects such as Wikipedia. It might take more time to fine-tune (partly because there are so many ways to do things). If you need the extra functionality, Lucene is worth trying.

Lucene In Action is a must-read if you want to use this library.

How it runs

Lucene is embedded inside the programs that use it (think Berkeley DB).

There are ports of Lucene to PHP, Ruby, C, C++, C#, Delphi and others. See Lucene’s website for details. Ports usually lag behind the main Lucene project. To use the main project with other programming languages, refer to PHP/Java Bridge (which ishas been renamed to VMBridge and is adding support for other languages).

For ready-to-use search-daemon-like service, refer to ~~Solr,~~"Solr:http://lucene.apache.org/solr/ , which is “open source enterprise search server based on the Lucene search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, and a web administration interface.”

Lucene does not do specific processing such as keyword-highlighting or web-crawling. Related projects (e.g. Solr, Nutch) fill in these gaps.

Supported queries

Lucene provides the “building-blocks” for a search system. In other words, you build the system yourself, but you can build it any way you want. For example, the several primitive types of queries are sufficient to build about any kind of query you want.

Word and phrase search
Boolean operations (AND, OR, NOT) with or without nesting
Span queries (a span is a [document, starting location, ending location] pair) that can build proximity search and more
Multi-field query
Grouping is not build-in but you can build it yourself
Sorting by fields or using your own function

~~Also,~~The biggest gain of building thing at this low level is flexibilty (of building whatever you like) and the additional information about the documents and the searh process that other “ready-to-run” libraries don’t give you. For example:

In addition to TermDocs ([term, document] pairs grouped by term), you have access to TermPositions ([term, document, position] pairs grouped by term) and term vectors (document, term, position] pairs grouped by document). Such information is very useful for doing statistics things fast (from word count to something more complex).

You can trace how Lucene selects and scores the ~~documents.~~
documents (using the Explanation class).

Data storage

Lucene supports storing and/or indexing and/or tokenizing text. Note that it treats everything as strings.

In theory, it can replace the database and still perform well on many kinds of applications. Not sure how it would perform though.

Performance

Indexing is moderately fast (about 8 minutes for the same 40000-document database) Query time varies greatly from 10ms to about 500ms. Repeated queries also take less time.

Resource usage

Lucene is ~~much~~ more resource-intensive than Sphinx, especially on the memory side. You will have to measure the difference youself. Our Lucene service (Lucene loaded by PHP/Java bridge) ~~uses~~process is about 100 MB ~~of memory~~ all the ~~time.~~time, though we haven’t tested how much of that memory is actually used. Java VMs usually have large startup-lags, so starting a “service” is much better than starting a new Java VM every time. However, watch for “memory leak” problem if you are going to do that. Prepared to spend some time tuning GC behavior too.