Apache Lucene is a de facto open source library targeting search and search- related tasks such as indexing or querying.
1. How Lucene indexes documents algorithmically
Lucene uses the inverted index data structure to index documents. The indexing strategy of Lucene is different from others when it creates segments and merges them, rather than uses a single index. Typically, an index segment consists of a dictionary index, a term dictionary and a posting dictionary.
Figure 1 illustrates an example of an index, where the merge factor equals three. In the example, there are 14 documents in the collection. Lucene repeatedly creates a segment for a document and periodically merges a group of 3 documents. The process is similar when Lucene keeps merging a group of 3 segments until there is no more segment to merge. After merging, all greyed segments will be removed and in total, Lucene merged 5 times after the indexing finishes. The approach of merging and deleting segments is considerably useful as the document collection does not change frequently. More importantly, the segment will never be modified. Instead, Lucene creates new segments when the document collection changes and later, it merges segments into new ones and deletes the old segments. This strategy ensures there is no conflict between reading writing indexes. Furthermore, this strategy also allows Lucene to avoid complex B-trees to store segments. Instead, all segments will be stored in flat files.
2. Extract text and index
Typically we can divide indexing documents into two distinct procedures, extracting text and creating index (Figure 2).
In extracting procedure, it is common to use a versatile parser that can extract textual contents from documents. One of widely known tools we can use to parse documents is Apache Tika. When parsing completes, we will have an input stream that needs to be indexed. It is the second procedure of our index process – creating an index.
More often than not, we do not index everything we retrieve from the input stream. We instead need to apply analysis process where we will discard unimportant terms from the textual content (e.g., punctuations, stopping words). In Lucene, the analysis process is handled by Analyzer
. The framework already provides handy and powerful analysers that we can use without customising in a majority of problems.
After analysing, the textual contents are ready to be indexed.
3. Building an index
In this section, we will look at how we can create a new index and add documents to it.
3.1 Construct an index writer
In Lucene, the index is created and maintained by IndexWriter
. The IndexWriter constructors can be defined in various ways but most often we simply need to supply
Directory d
where the index will be storedIndexWriterConfig conf
holds all the configuration that is used to construct anIndexWriter
In the code below, we can see that I have define an instance of IndexWriterConfig
with a specialised Analyzer`. The purpose is I want to analyse documents in a particular way that suits my personal needs (e.g., I want to keep specific words that built-in analysers will certainly remove).
In the given code sample, we use FSDirectory
. There are other alternative solutions, such as SimpleFSDirectory
, NIOFSDirectory
.
/**
* Create an Indexer using the given directory to store index files, and the
* directory need to be indexed.
*
* */
public Indexer(String anIndexDir, String aDataDir) {
// set place to store indexes
this.indexDir = anIndexDir;
// set directory has to be indexed
this.dataDir = aDataDir;
try {
// create Directory to store indexes, use FSDirectory.open
// to automatically pick the most suitable directory implementation
Directory dir = FSDirectory.open(new File(indexDir));
// create an instance of Analyzer with my own custom analyzer
analyzer = new MyCustomAnalyzer();
// create configuration for new IndexWriter
IndexWriterConfig conf = new IndexWriterConfig(Version.LUCENE_46,
analyzer);
// create IndexWriter, which can create new index, or
// adds, removes, or updates documents in the index
// however, it can not read or search
this.writer = new IndexWriter(dir, conf);
} catch (IOException e) {
e.printStackTrace();
}
}
3.2 Modelling document
In the framework textual content is modelled using the class Document
. This class allows to model a document instance with numerous attributes where for every attribute we can also define using the class Field
.
In the sample code below, I have define an Document
instance with a number of attributes:
- Document ID: because in practice we may have thousands of documents, I have used base64 to generate unique IDs for documents. Notice that we supply the
StringField
constructor with the variableField.Store.YES
, which means the values of this attributes will be stored in index files. More importantly, we do not index theStringField
by default. - File path: will be stored
- Content field: as you may think, we need to index the content field for further full-text search. However, we may or may not store it in the index files. It is a matter of choice. In this particular case, I choose to store the field.
Document doc = new Document();
// docId
// generated by base64 of filepath
String id = BaseEncoding.base64().encode(f.getCanonicalPath().getBytes(Charsets.US_ASCII));
doc.add(new StringField(MyDocumentIndexedProperties.ID_FIELD, id,
Field.Store.YES));
// filepath
doc.add(new StringField(MyDocumentIndexedProperties.FILE_PATH_FIELD, f
.getCanonicalPath(), Field.Store.YES));
// content field
TokenStream ts = analyzer.tokenStream(
MyDocumentIndexedProperties.CONTENT_FIELD, new StringReader(
handler.toString()));
Field contentField = new Field(
MyDocumentIndexedProperties.CONTENT_FIELD,
AnalyzerUtils.tokenStreamToString(ts), TYPE_STORED);
ts.close();
doc.add(contentField);
3.3 Index document
While Lucene consists many components, it, at the same time, uses such extremely simple and intuitive methods to abstract the complex structures that it is fascinating seeing how it is simple to perform indexing.
Document doc = new Document();
//modelling document in the same manner as section 3.2
//...
//...
//add document to the index
writer.addDocument(doc);
//close the index when finish
//which also means we commit the index to be ready to store
writer.close()
4. Full-text search
Until now we have an index that can be used to search over. To initialise a search, we must input a search query, which can be a keyword or even a wild card. In Lucene, the search query will firstly be handled by QueryParser
where the query will be parsed into meaningful structure, called Query
. After that, the IndexSearcher
will consume Query
instance in order to look up over the index.
// searcher
IndexSearcher searcher = new IndexSearcher(this.indexReader);
// query parser
QueryParser parser = new QueryParser(Version.LUCENE_46, "content", new SimpleAnalyzer());
// parsing query
Query q = parser.parse("....");
// look up and return top 10 relevant documents
Topdocs docs = searcher.search(query, 10);
5. Bottom line
So far we have seen a “hello world” example of using Lucene to perform full-text search. In real project, we can expect there are a number of possibilities where we need to tweak Lucene so that it can serve our search purpose at a higher level. An example scenario is when we need to support like-this search where the system allows the user to search textual contents that are similar to the keywords. Theoretically we can measure the similarity by using distance measure methods, such as Euclidean distance measure, cosine distance measure, or Manhattan distance measure, etc. Another scenario is when we can create a system that supports synonym search where we retrieve not only document contains the search keywords but also documents that has synonyms of keywords.
As said, in spite of the simplicity, Lucene adds a tremendous amount of support in building a powerful search tool. In the future post, I may discuss how we can combine both Apache Lucene and Apache Mahout (a machine learning framework) in order to index a corpus and use that index to perform clustering. Stay tuned!