Search code examples
lucenefull-text-indexing

Lucene: multiple documents for a single "resource"


My model here consists on online courses. Every course has got an id number, a title and can have a different number of content files (large html files). I tried to represent them in Lucene using the following scheme (every line is a document):

  • course: "1", title: "Introduction to Java"
  • course: "1", content: "Chapter 1: basics..."
  • course: "1", content: "Chapter 2: collections..."
  • course: "2", title: "Java networking"
  • course: "2", content: "First part: sockets..."
  • course: "3", title: ...

But now, suppose I need to ask Lucene to give me all the courses (just the id) with "Java" in the title and "collections" in some of its contents. A query such as title:java AND content:collections won't work because the information is split into several documents.

Can somebody suggest me some alternate representation or querying technique to address this problem? Note that I can't just join all the contents into a single file and index it in the same document along with the title because some chapters are added after the course has been created.

Thanks in advance.


Solution

  • I've not tried it yet, but check out index-time or query-time joins: http://lucene.apache.org/core/4_0_0/join/org/apache/lucene/search/join/package-summary.html

    Here's a presentation on it: http://www.lucenerevolution.org/sites/default/files/grouping-and-joining_0.pdf.