I would like to use Lucene.NET to store and query term vectors. However, I do not want the term vectors to be created from documents. Instead, I want to be able to write and update the term vectors directly, without positions or offsets of the term/token.
The workaround would be to generate text from a term vector, i.e. from the term vector
foo: 3; bar: 1
generate the text
foo, foo, foo, bar
and let Lucene index that text. If I want to update the term frequency of bar to 2
, I could get the stored text (or generate it from the old term vector, if I don't store it), change it to
foo, foo, foo, bar, bar
and update the according document in the index.
This is quite expensive for such a simple task. Obviously, this is not the use case, Lucene was built to be used for. Still, I would like to be able to use the power of Lucene for querying, etc..
Is there a way to write term vectors for a document directly or do you have any other good ideas?
As I said in my question, Lucene is not intended for storing and manipulating term vectors directly. The initial approach is more or less the way to go at least with regards to the process of updating the term vector:
Delete, then Add
equals Update
in Lucene)I haven't found a way to update a single term frequency in the vector without reindexing the entire document.
One improvement of the method described in the question is to encode the termvector as term-frequency pairs:
Instead of
foo foo foo bar
the field content can be written as
foo:3; bar:1;
You can then write a custom TokenFilter
which reads these tokens one by one and then returns the term n
times. This will not improve performance but simplify handling of the term vectors. If you're not familiar with custom token filters and analyzers it is probably not worth it to use this approach and I would stick with the naive version I already suggested in the question.