I am new in Solr and I have to do a filter to lemmatize text to index documents and also to lemmatize querys.
I created a custom Tokenizer Factory for lemmatized text before passing it to the Standard Tokenizer.
Making tests in Solr analysis section works fairly good (on index ok, but on query sometimes analyzes text two times), but when indexing documents it analyzes only the first documment and on querys it analyses randomly (It only analyzes first, and to analyze another you have to wait a bit time). It's not performance problem because I tried modifyng text instead of lemmatizing.
Here is the code:
package test.solr.analysis;
import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import java.util.Map;
import org.apache.lucene.analysis.util.TokenizerFactory;
import org.apache.lucene.util.AttributeSource.AttributeFactory;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.standard.StandardTokenizer;
//import test.solr.analysis.TestLemmatizer;
public class TestLemmatizerTokenizerFactory extends TokenizerFactory {
//private TestLemmatizer lemmatizer = new TestLemmatizer();
private final int maxTokenLength;
public TestLemmatizerTokenizerFactory(Map<String,String> args) {
super(args);
assureMatchVersion();
maxTokenLength = getInt(args, "maxTokenLength", StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
if (!args.isEmpty()) {
throw new IllegalArgumentException("Unknown parameters: " + args);
}
}
public String readFully(Reader reader){
char[] arr = new char[8 * 1024]; // 8K at a time
StringBuffer buf = new StringBuffer();
int numChars;
try {
while ((numChars = reader.read(arr, 0, arr.length)) > 0) {
buf.append(arr, 0, numChars);
}
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("### READFULLY ### => " + buf.toString());
/*
The original return with lemmatized text would be this:
return lemmatizer.getLemma(buf.toString());
To test it I only change the text adding "lemmatized" word
*/
return buf.toString() + " lemmatized";
}
@Override
public StandardTokenizer create(AttributeFactory factory, Reader input) {
// I print this to see when enters to the tokenizer
System.out.println("### Standar tokenizer ###");
StandardTokenizer tokenizer = new StandardTokenizer(luceneMatchVersion, factory, new StringReader(readFully(input)));
tokenizer.setMaxTokenLength(maxTokenLength);
return tokenizer;
}
}
With this, it only indexes the first text adding the word "lemmatized" to the text. Then on first query if I search "example" word it looks for "example" and "lemmatized" so it returns me the first document. On next searches it doesn't modify the query. To make a new query adding "lemmatized" word to the query, I have to wait some minutes.
What happens?
Thank you all.
I highly doubt that the create
method is invoked on each query (for starters performance issues come to mind). I would take the safe route and create a Tokenizer
that wraps a StandardTokenizer
, then just override the setReader
method and do my work there