Search code examples
javaluceneapache-commons-digester

Commons Digester: How to build complex, XML-based queries with Apache Lucene?


I need to build a XML-based query with Apache Lucene and Commons Digester.

My docs have this format:

<doc>
<id>361492799</id>
<title>Dan1</title>
<description>We had another Flickr meetup in Rochester, the biggest that Ive been to. 12 people showed up.Da, he was to the right.</description>
<time>18934934</time>
<tags>flickrmeetup rochester dan totheright 200701</tags>
<geo><latitude>324234</latitude><longitude>28342349</longitude></geo>
<event>135961</event>
</doc>

And the query is actually also a document that I need to compare with the entire collection. Each attribute has a different similarity metric. For example, "description" has tf-idf cosine similarity. "Time" is just the difference and "latitude" + "longitude" is compared using the haversine distance.

For now I've only performed searches with simple textual queries such as "word1 word2". How can I build more complex queries instead ?

Thanks


Solution

  • I need to build a XML-based query with Apache Lucene and Commons Digester.

    This article should help you get started

    for analysing content from xml, take a look at TIKA

    Apache Tika - a content analysis toolkit

    Apache Tika™ is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.