Search code examples
javanlppos-taggeropennlp

How to use OpenNLP with Java?


I want to POStag an English sentence and do some processing. I would like to use openNLP. I have it installed

When I execute the command

I:\Workshop\Programming\nlp\opennlp-tools-1.5.0-bin\opennlp-tools-1.5.0>java -jar opennlp-tools-1.5.0.jar POSTagger models\en-pos-maxent.bin < Text.txt

It gives output POSTagging the input in Text.txt

    Loading POS Tagger model ... done (4.009s)
My_PRP$ name_NN is_VBZ Shabab_NNP i_FW am_VBP 22_CD years_NNS old._.


Average: 66.7 sent/s
Total: 1 sent
Runtime: 0.015s

I hope it installed properly?

Now how do i do this POStagging from inside a java application? I have added the openNLPtools, jwnl, maxent jar to the project but how do i invoke the POStagging?


Solution

  • Here's some (old) sample code I threw together, with modernized code to follow:

    package opennlp;
    
    import opennlp.tools.cmdline.PerformanceMonitor;
    import opennlp.tools.cmdline.postag.POSModelLoader;
    import opennlp.tools.postag.POSModel;
    import opennlp.tools.postag.POSSample;
    import opennlp.tools.postag.POSTaggerME;
    import opennlp.tools.tokenize.WhitespaceTokenizer;
    import opennlp.tools.util.ObjectStream;
    import opennlp.tools.util.PlainTextByLineStream;
    
    import java.io.File;
    import java.io.IOException;
    import java.io.StringReader;
    
    public class OpenNlpTest {
    public static void main(String[] args) throws IOException {
        POSModel model = new POSModelLoader().load(new File("en-pos-maxent.bin"));
        PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");
        POSTaggerME tagger = new POSTaggerME(model);
    
        String input = "Can anyone help me dig through OpenNLP's horrible documentation?";
        ObjectStream<String> lineStream =
                new PlainTextByLineStream(new StringReader(input));
    
        perfMon.start();
        String line;
        while ((line = lineStream.read()) != null) {
    
            String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE.tokenize(line);
            String[] tags = tagger.tag(whitespaceTokenizerLine);
    
            POSSample sample = new POSSample(whitespaceTokenizerLine, tags);
            System.out.println(sample.toString());
    
            perfMon.incrementCounter();
        }
        perfMon.stopAndPrintFinalResult();
    }
    }
    

    The output is:

    Loading POS Tagger model ... done (2.045s)
    Can_MD anyone_NN help_VB me_PRP dig_VB through_IN OpenNLP's_NNP horrible_JJ documentation?_NN
    
    Average: 76.9 sent/s 
    Total: 1 sent
    Runtime: 0.013s
    

    This is basically working from the POSTaggerTool class included as part of OpenNLP. The sample.getTags() is a String array that has the tag types themselves.

    This requires direct file access to the training data, which is really, really lame.

    An updated codebase for this is a little different (and probably more useful.)

    First, a Maven POM:

    <?xml version="1.0" encoding="UTF-8"?>
    <project xmlns="http://maven.apache.org/POM/4.0.0"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
        <modelVersion>4.0.0</modelVersion>
    
        <groupId>org.javachannel</groupId>
        <artifactId>opennlp-example</artifactId>
        <version>1.0-SNAPSHOT</version>
        <dependencies>
            <dependency>
                <groupId>org.apache.opennlp</groupId>
                <artifactId>opennlp-tools</artifactId>
                <version>1.6.0</version>
            </dependency>
            <dependency>
                <groupId>org.testng</groupId>
                <artifactId>testng</artifactId>
                <version>[6.8.21,)</version>
                <scope>test</scope>
            </dependency>
        </dependencies>
        <build>
            <plugins>
                <plugin>
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-compiler-plugin</artifactId>
                    <version>3.1</version>
                    <configuration>
                        <source>1.8</source>
                        <target>1.8</target>
                    </configuration>
                </plugin>
            </plugins>
        </build>
    </project>
    

    And here's the code, written as a test, therefore located in ./src/test/java/org/javachannel/opennlp/example:

    package org.javachannel.opennlp.example;
    
    import opennlp.tools.cmdline.PerformanceMonitor;
    import opennlp.tools.postag.POSModel;
    import opennlp.tools.postag.POSSample;
    import opennlp.tools.postag.POSTaggerME;
    import opennlp.tools.tokenize.WhitespaceTokenizer;
    import org.testng.annotations.DataProvider;
    import org.testng.annotations.Test;
    
    import java.io.File;
    import java.io.FileOutputStream;
    import java.io.IOException;
    import java.net.URL;
    import java.nio.channels.Channels;
    import java.nio.channels.ReadableByteChannel;
    import java.util.stream.Stream;
    
    public class POSTest {
        private void download(String url, File destination) throws IOException {
            URL website = new URL(url);
            ReadableByteChannel rbc = Channels.newChannel(website.openStream());
            FileOutputStream fos = new FileOutputStream(destination);
            fos.getChannel().transferFrom(rbc, 0, Long.MAX_VALUE);
        }
    
        @DataProvider
        Object[][] getCorpusData() {
            return new Object[][][]{{{
                    "Can anyone help me dig through OpenNLP's horrible documentation?"
            }}};
        }
    
        @Test(dataProvider = "getCorpusData")
        public void showPOS(Object[] input) throws IOException {
            File modelFile = new File("en-pos-maxent.bin");
            if (!modelFile.exists()) {
                System.out.println("Downloading model.");
                download("http://opennlp.sourceforge.net/models-1.5/en-pos-maxent.bin", modelFile);
            }
            POSModel model = new POSModel(modelFile);
            PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");
            POSTaggerME tagger = new POSTaggerME(model);
    
            perfMon.start();
            Stream.of(input).map(line -> {
                String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE.tokenize(line.toString());
                String[] tags = tagger.tag(whitespaceTokenizerLine);
    
                POSSample sample = new POSSample(whitespaceTokenizerLine, tags);
    
                perfMon.incrementCounter();
                return sample.toString();
            }).forEach(System.out::println);
            perfMon.stopAndPrintFinalResult();
        }
    }
    

    This code doesn't actually test anything - it's a smoke test, if anything - but it should serve as a starting point. Another (potentially) nice thing is that it downloads a model for you if you don't have it downloaded already.