Why are Lucene's stored fields so slow to access

Problem

I have a Lucene index that has some large fields (about 50 KB each) and some small fields (about 50 bytes each). I need to access (iterate) one of the small fields for say 1/10 of the documents. For some reason, such operation is very slow, unreasonably so for such a small field.

Cause

Lucene provides a number of “policies” of how to access fields of a document. (See class org.apache.lucene.document.FieldSelector.) They specify when and how fields are loaded from the index. It turns out that the default is to load all fields in the document as soon as a Document is requested by, say IndexReader. (See class org.apache.lucene.index.FieldsReader, in particular, how it implements the doc(n, FieldSelector) function.) Therefore, when you load a small field, the large fields are also loaded, causing performance problem if you repeat the operation many times.

Solution

The class org.apache.lucene.document.FieldSelectorResult provides several “policies” that you can use. The most interesting one w.r.t. our problem is FieldSelectorResult.LAZY_LOAD. It basically specifies that a field is lazily loaded (i.e. loaded only when needed).

To use this policy, create a FieldSelector object.

FieldSelector lazyFieldSelector = new FieldSelector() {	public FieldSelectorResult accept(String fieldName) {	    return FieldSelectorResult.LAZY_LOAD;	}};

When you request the document from an IndexReader, pass this object too.

IndexReader reader;...// Open the index reader...Document doc = reader.document(docId, lazyFieldSelector);

Note that to get the field, use the Document’s getFieldable(String) method instead of getField(String). This is according to the API reference.

Fieldable fieldable = document.getFieldable(fieldName);String value = fieldable.stringValue();// Use the field value

Solution

Within a document, stored fields are read sequentially. (See Index File Formats.) In theory, accessing the first fields should be faster than reading the last ones.

Fields are ordered and their orders are stored implicity in the .fnm file. The order that the fields are read from should be the same as the order that you create the fields. To gain performance, create frequently used (and small) stored fields first.

Cause

For some reason, this is still slower than indexing the field and then iterate through all the terms in the field. I looked closer and found another bottleneck.

A Lucene index stores the lengths of the fields in terms of character count, not byte count; Also, a character can be more than a byte long. As we have seen, Lucene stores and processes the fields sequentially. Even if it does not load a field, it must read the whole content of a field to get to the next field. If a large field is not loaded but is before a small field that is loaded, the processing time depends on the length of both fields, not just the small ones.

Solution

The problem will not happen if the field you need to iterate is placed before the large fields, and if you ask the FieldSelector to stop at the field you want.

Say you want to iterate only field “field1”. Then create a FieldSelector that only loads field1 and stops at this field. When creating the index, remember to put the large fields after field1.

String fieldToIterate = "field1";...FieldSelector lazyFieldSelector = new FieldSelector() {	public FieldSelectorResult accept(String fieldName) {		if (fieldName.equals(fieldToIterate))			return FieldSelectorResult.LOAD_AND_BREAK;		else			return FieldSelectorResult.NO_LOAD;	}};

The rest of the code should be the same.

How do I update root certificates in Apache/PHP/cURL environment

What are the differences between addslashes(), mysql_escape_string() and mysql_real_escape_string()

PHP and ODBC

Speed of unpack() in PHP

Configuring PEAR on Windows

How do I use cURL in PHP on Windows?

Passing command-line arguments into PHP

Using SSL socket in PHP under Windows

PHP Resources

PHP error reporting

PHPXref vs PHPDocumentor

Create a PHP unit test case using SimpleTest

PHP Commenting Style

phpMyAdmin Security

PHP

How can I make phpMyAdmin avoid sending MySQL passwords in the clear?

PHP ODBC Setup Guide

Performance of array_shift and array_pop in PHP

Javascript

Numeric Validation JavaScript

The advantages of Javascript

jQuery Tutorial

Which Javascript framework should I use?

jQuery and JavaScript Coding: Examples and Best Practices

Java

XML and Java

Converting Java content into AJAX (Javascript and XML)

Create a Java class that is only comparable to itself

Removing old Java versions

Java Server Faces

MySQL Resources

PostgreSQL Resources

Why doesn't mysqlshow work for databases or tables with underscores in their names?

What is mysqlshow good for?

How can I search/replace strings in MySQL?

Microsoft Access, OpenOffice and MySQL

SQL joins

Get rid of default annoyances in MySQL Workbench

Who uses PostgreSQL at UCLA?

Why NoSQL Matters

Subversion

Revision Control

Revision Control Systems Compared

Installing Subversion on Windows

GIT info

What are some document management services/document version control applications out there?

svn: Working copy '<filename>' is missing or not locked

Learning about CSS

What sort of menus can I make with CSS?

Top Ten Web Design Mistakes of 2005

The importance of "!important" in CSS

CSS Design Concerns for IE6, IE7, and Firefox

Forcing a page break with CSS

What's a solid starting point (global reset) for a CSS file?

UX Team ( UCLA Library - Digital Initiatives & Information Technology )

Hi, are there any UCLA style resources or style guides for websites?

UX Resources

What to do when CSS stylesheets refuse to apply

Web Accessibility Resources

Sass versus LESS

Introduction to XML in Flash - Making Flash Dynamic

XML Resources

XML

Why is it important to use short names in Plone?

Plone CMS Resources

Plone 4 Tips and Tricks: Table of Contents

How do I identify the stylesheets in Plone?

How to get rid of icons in Plone

Importing and exporting a Plone site

Installing Plone v3.2 on Mac OS X 10.5

Remove highlighting of search terms in Plone

Is there a permission that allows a user edit content that s/he does not own in Plone?

Why can't I add a photo using AT Photo in Plone?

Shibboleth For Plone

How do I get started with designing new/existing layouts in Plone?

Backing up and packing Plone's database file (Data.fs)

Zope/Plone usage statistics

Should I use plonecustom.css when changing the layout for my Plone site

Changing number of displayed news/events in Plone portlets

Search across multiple Plone instances