Search code examples
javasolrlucenelemmatization

Problems with Solr Tokenizer adding a lemmatizer


I'm adding a text lemmatizer to Solr. I have to process the entire text because the context in lemmatization is important.

I get this code on internet and I modified a bit

http://grokbase.com/t/lucene/solr-user/138d0qn4v0/issue-with-custom-tokenizer

I added our lemmatizer and I changed this line

endOffset = word.length();

for this

endOffset = startOffset + word.length();

Now if I use the Solr Admin analisys, I have no problems in Index or Query values. I write the phrase and when I analyse values, the results is the text well lemmatized.

The problems are when I make queries on Query section and when I index documents. Checking debugquery I can see this. If I ask for "korrikan" text (means "running") in "naiz_body", the text is well lemmatized.

<str name="rawquerystring">naiz_body:"korrikan"</str>
<str name="querystring">naiz_body:"korrikan"</str>
<str name="parsedquery">naiz_body:korrika</str>
<str name="parsedquery_toString">naiz_body:korrika</str>

Now if at the moment I ask for "jolasten" text (means "playing") the text is not lemmatized, and the parsedquery and parsedquery_toString is not changed.

<str name="rawquerystring">naiz_body:"jolasten"</str>
<str name="querystring">naiz_body:"jolasten"</str>
<str name="parsedquery">naiz_body:korrika</str>
<str name="parsedquery_toString">naiz_body:korrika</str>

If I wait for a bit (or if I stop solr and I run it) and I ask for "jolasten" text I get the text well lemmatized

<str name="rawquerystring">naiz_body:"jolasten"</str>
<str name="querystring">naiz_body:"jolasten"</str>
<str name="parsedquery">naiz_body:jolastu</str>
<str name="parsedquery_toString">naiz_body:jolastu</str>

Why?

Here is the code:

package eu.solr.analysis;

import java.io.IOException;
import java.io.Reader;
import java.util.ArrayList;
import java.util.List;

import eu.solr.analysis.Lemmatizer;

import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;

public class LemmatizerTokenizer extends Tokenizer {
    private Lemmatizer lemmatizer = new Lemmatizer();
    private List<Token> tokenList = new ArrayList<Token>();
    int tokenCounter = -1;

    private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
    private final OffsetAttribute offsetAttribute = (OffsetAttribute)addAttribute(OffsetAttribute.class);
    private final PositionIncrementAttribute position = (PositionIncrementAttribute)addAttribute(PositionIncrementAttribute.class);

    public LemmatizerTokenizer(AttributeFactory factory, Reader reader) {
        super(factory, reader);
        System.out.println("### Lemmatizer Tokenizer ###");
        String textToProcess = null;
        try {
            textToProcess = readFully(reader);
            processText(textToProcess);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    public String readFully(Reader reader) throws IOException {
        char[] arr = new char[8 * 1024]; // 8K at a time
        StringBuffer buf = new StringBuffer();
        int numChars;
        while ((numChars = reader.read(arr, 0, arr.length)) > 0) {
            buf.append(arr, 0, numChars);
        }
        System.out.println("### Read Fully ### => " + buf.toString());
        return lemmatizer.getLemma(buf.toString());
    }

    public void processText(String textToProcess) {
        System.out.println("### Process Text ### => " + textToProcess);
        String wordsList[] = textToProcess.split(" ");
        int startOffset = 0, endOffset = 0;
        for (String word : wordsList) {
            endOffset = startOffset + word.length();
            Token aToken = new Token(word, startOffset, endOffset);
            aToken.setPositionIncrement(1);
            tokenList.add(aToken);
            startOffset = endOffset + 1;
        }
    }

    @Override
    public boolean incrementToken() throws IOException {
        clearAttributes();
        tokenCounter++;
        System.out.println("### Increment Token ###");
        System.out.println("Token Counter => " + tokenCounter);
        System.out.println("TokenList size => " + tokenList.size());
        if (tokenCounter < tokenList.size()) {
            Token aToken = tokenList.get(tokenCounter);
            System.out.println("Increment Token => " + aToken.toString());
            termAtt.append(aToken);
            termAtt.setLength(aToken.length());
            offsetAttribute.setOffset(correctOffset(aToken.startOffset()),
            correctOffset(aToken.endOffset()));
            position.setPositionIncrement(aToken.getPositionIncrement());
            return true;
        }
        return false;
    }

    @Override
    public void close() throws IOException {
        System.out.println("### Close ###");
        super.close();
    }

    @Override
    public void end() throws IOException {
        // setting final offset
        System.out.println("### End ###");
        super.end();
    }

    @Override
    public void reset() throws IOException {
        System.out.println("### Reset ###");
        tokenCounter = -1;
        super.reset();
    }
}

Thank you all!

edit:

answer to @alexandre-rafalovitch The Analysis screen in Admin UI works well. If I make a query or index text, the text is well lemmatized. The problem is in the Query UI. If I make a query first calls to lemmatizer, but the second one looks like uses the buffered first lemmatized text and calls directly to incrementToken. See the code output when I make this queries: In Analysis UI if I query for Korrikan and then for Jolasten It outputs this:

## BasqueLemmatizerTokenizer create
### BasqueLemmatizer Tokenizer ###
### Read Fully ### => korrikan
### Eustagger OUT ### => korrika  
### Process Text ### => korrika  
### Reset ###
### Increment Token ###
Token Counter => 0
TokenList size => 1
Increment Token => korrika
### Increment Token ###
Token Counter => 1
TokenList size => 1

## BasqueLemmatizerTokenizer create
### BasqueLemmatizer Tokenizer ###
### Read Fully ### => Jolasten
### Eustagger OUT ### => jolastu  
### Process Text ### => jolastu  
### Reset ###
### Increment Token ###
Token Counter => 0
TokenList size => 1
Increment Token => jolastu
### Increment Token ###
Token Counter => 1
TokenList size => 1

If I make this query on Query UI it outputs this:

## BasqueLemmatizerTokenizer create
### BasqueLemmatizer Tokenizer ###
### Read Fully ### => korrikan
### Eustagger OUT ### => korrika  
### Process Text ### => korrika  
### Reset ###
### Increment Token ###
Token Counter => 0
TokenList size => 1
Increment Token => korrika
### Increment Token ###
Token Counter => 1
TokenList size => 1
### End ###
### Close ###

### Reset ###
### Increment Token ###
Token Counter => 0
TokenList size => 1
Increment Token => korrika
### Increment Token ###
Token Counter => 1
TokenList size => 1
### End ###
### Close ###

In the second one, it doens't create a tokenizer, looks like it reset it but it read the old text.

I wrote to the code owner and he answered me to see TrieTokenizer.


Solution

  • Finally I did!

    I modified the PatternTokenizer and then I used the StandardTokenizer to use the lemmatizer. In brief, I lemmatize the string from input, and then create an StringReader with the lemmatized text.

    Here is the code, hope it can be useful for somebody (Modifying the StandardTokenizer script):

    ...

    public String processReader(Reader reader) throws IOException {
        char[] arr = new char[8 * 1024]; // 8K at a time
        StringBuffer buf = new StringBuffer();
        int numChars;
        while ((numChars = reader.read(arr, 0, arr.length)) > 0) {
            buf.append(arr, 0, numChars);
        }
        return lemmatizer.getLemma(buf.toString());
    }
    

    ...

    public void reset() throws IOException {
        scanner.yyreset(new StringReader(processReader(input)));
    }