Search code examples
javalucenecollective-intelligence

Correct way to write a Tokenizer in Lucene


I'm trying to analyze content of a Drupal database for collective intelligence purposes.

So far I've been able to work out a simple example that tokenizes the various contents (mainly forum posts) and count tokens after removing stop words.

The StandardTokenizer supplied with Lucene should be able to tokenize hostnames and emails but content can have also embedded html, e.g:

Pubblichiamo la presentazione di IBM riguardante DB2 per i vari sistemi operativi
Linux, UNIX e Windows.\r\n\r\nQuesto documento sta sulla piattaforma KM e lo potete
scaricare a questo <a href=\'https://sfkm.griffon.local/sites/BSF%20KM/BSF/CC%20T/Specifiche/Eventi2008/IBM%20DB2%20for%20Linux,%20UNIX%20e%20Windows.pdf\' target=blank>link</a>.

This is tokenized badly in this way:

pubblichiamo -> 1
presentazione -> 1
ibm -> 1
riguardante -> 1
db2 -> 1
vari -> 1
sistemi -> 1
operativi -> 1
linux -> 1
unix -> 1
windows -> 1
documento -> 1
piattaforma -> 1
km -> 1
potete -> 1
scaricare -> 1
href -> 1
https -> 1
sfkm.griffon.local -> 1
sites -> 1
bsf -> 1
20km/bsf -> 1
cc -> 1
20t/specifiche/eventi2008/ibm -> 1
20db2 -> 1
20for -> 1
20linux -> 1
20unix -> 1
20e -> 1
20windows.pdf -> 1
target -> 1
blank -> 1
link -> 1

What I would like to have is to keep links together and strip html tags (like <pre> or <strong>) that are useless.

Should I write a Filter or a different Tokenizer? The Tokenizer should replace the standard one or can I mix them together? The hardest way would be to take StandardTokenizerImpl and copy it in a new file, then add custom behaviour, but I wouldn't like to go too deep in Lucene implementation for now (learning gradually).

Maybe there is already something similar implemented but I've been unable to find it.

EDIT: Looking at StandardTokenizerImpl makes me think that if I have to extend it by modifying the actual implementation it's not so convenient compared to using lex or flex and doing it by myself..


Solution

  • This is most easily achieved by pre processing the text before giving it to lucene to tokenize. Use an html parser, like Jericho to convert your content into text with no html by stripping out tags you dont care about, and extracting the text from those that you do. Jericho's TextExtractor is perfect for this, and easy to use.

    String text = "Pubblichiamo la presentazione di IBM riguardante DB2 per i vari sistemi operativi"
        +"Linux, UNIX e Windows.\r\n\r\nQuesto documento sta sulla piattaforma KM e lo potete"
        +"scaricare a questo <a href=\'https://sfkm.griffon.local/sites/BSF%20KM/BSF/CC%20T/Specifiche/Eventi2008/IBM%20DB2%20for%20Linux,%20UNIX%20e%20Windows.pdf\' target=blank>link</a>.";
    
    TextExtractor te = new TextExtractor(new Source(text)){
        @Override
        public boolean excludeElement(StartTag startTag) {
            return startTag.getName() != HTMLElementName.A;
        }
    };
    System.out.println(te.toString());
    

    This outputs:

    Pubblichiamo la presentazione di IBM riguardante DB2 per i vari sistemi operativiLinux, UNIX e Windows. Questo documento sta sulla piattaforma KM e lo potetescaricare a questo link.

    You could use a custom Lucene Tokenizer with an html Filter, but it's not the easiest solution - using Jericho will defn save you development time for this task. The existing html analysers for lucene probably don't want to do exactly what you want, as they will keep all text on the page. The only caveat to this is that you will end up processing the text twice, rather than all as one stream, but unless you are handling Terabytes of data you aint gonna care about this performance consideration, and dealing with performance is something best left untill you have your app fleshed out and have identified it as an issue anyway.