Search code examples
javanlpstemmingsnowball

Italian stemming library in java


i'm searching a java library or something to do stemming of italian strings of words.

The goal is to compare italian words. In this moment words like "attacco", "attacchi","attaccare" etc., are considered different, instead I want returned a true comparison.

I found something like Lucene, snowball.tartarus.org, etc. Is there something else useful, or how can I use them in java?

Thanks for answers.


Solution

  • Download Snowball for Java here.

    It includes a class named org.tartarus.snowball.ext.italianStemmer which extends SnowballStemmer.

    To use a SnowballStemmer please take a look at the following test code for verb attaccare present tense:

    import org.junit.Test;
    import org.tartarus.snowball.SnowballStemmer;
    import org.tartarus.snowball.ext.italianStemmer;
    
    public class SnowballItalianStemmerTest {
    
        @Test
        public void testSnowballItalianStemmerAttaccare() {
    
            SnowballStemmer stemmer = (SnowballStemmer) new italianStemmer();
    
            String[] tokens = "attacco attacchi attacca attacchiamo attaccate attaccano".split(" ");    
            for (String string : tokens) {
                stemmer.setCurrent(string);
                stemmer.stem();
                String stemmed = stemmer.getCurrent();
                Assert.assertEquals("attacc", stemmed);
                System.out.println(stemmed);
            }
    
        }
    
    }
    

    Output:

    attacc
    attacc
    attacc
    attacc
    attacc
    attacc
    

    For another example of use see TestApp.java included in the same tgz file.

    Lucene, which is written in Java, uses Snowball for stemming, for example as a filter in SnowballFilter.