Search code examples
lucenelucene.net

Lucene calculate term vectors for existing index


With Lucene.net I would like to get the term vectors as described in this stackoverflow question.

The problem is, the index is already generated with the field indexed and stored, but without term vectors.

FieldType type = new FieldType();
type.setIndexed(true);
type.setStored(true);
type.setStoreTermVectors(false);

Theoretically, it should be possible to re-calculate the term vectors for each document and then store it in the index.

Do you know how this could be possible, without deleting the complete Lucene index?


Solution

  • As mentioned in my comments in the question, you can generate term vector data on-the-fly, which may help you to avoid a complete rebuild of your indexed data.

    In my scenario, I want to find the offset positions of my search term in the matched document.

    I don't want to oversell this approach - it's absolutely not a substitute for re-indexing - but if your queries are basic, it may help.


    Step 1: Perform whatever query you are currently performing.

    For each document in the list of hits, you will then need to re-process the relevant field from that document - so, either you already have the field data stored in your existing index, or you will need to retrieve it from its original source.


    Step 2: For each such field, you can re-use the same analyzer to build a token stream on-the-fly. The token stream can be configured with different attributes, such as:

    • token attributes
    • offset attributes
    • and others (see here)

    Example:

    using Lucene.Net.Analysis.Standard;
    using Lucene.Net.Analysis.TokenAttributes;
    using Lucene.Net.Util;
    
    const LuceneVersion AppLuceneVersion = LuceneVersion.LUCENE_48;
    
    String? fieldName = null;
    String fieldContent = "Foo Bar Baz Bar Bat";
    String searchTerm = "bar";
    
    var analyzer = new StandardAnalyzer(AppLuceneVersion);
    var ts = analyzer.GetTokenStream(fieldName, fieldContent);
    var charTermAttr = ts.AddAttribute<ICharTermAttribute>();
    var offsetAttr = ts.AddAttribute<IOffsetAttribute>();
    
    try
    {
        ts.Reset();
        Console.WriteLine("");
        Console.WriteLine("Token: " + searchTerm);
        while (ts.IncrementToken())
        {
            if (searchTerm.Equals(charTermAttr.ToString())) 
            {
                var start = offsetAttr.StartOffset;
                var end = offsetAttr.EndOffset;
                Console.WriteLine(String.Format("  > offset: {0}-{1}", start, end));
            }
        }
        ts.End();
    }
    catch (Exception)
    {
    
        throw;
    }
    

    The above example assumes one of the hits from step 1 was a field containing "Foo Bar Baz Bar Bat" - with a search term of bar.

    The output generated is:

    Token: bar
      > offset: 4-7
      > offset: 12-15
    

    So, as you can see, you are not re-executing a query - you are just re-processing a token stream. The more complex the original search term is, the harder it will be to make this approach work the way you probably need it to.