Search code examples
javanlpstanford-nlp

extract a linguistic structure based on POS tagged sentence using Stanford nlp in JAVA


I am new in Natural Language Processing (NLP), I want to do part-of-speech tagging (POS) and then do find a specific structure within a text. I could manage POS tagging using Stanford-NLP but, I do not know how to extract this structure:

NN/NNS + IN + DT + NN/NNS/NNP/NNPS

public static void main(String args[]) throws Exception{
    //input File
    String contentFilePath = "";
    //outputFile
    String triplesFilePath = contentFilePath.substring(0, contentFilePath.length()-4)+"_postagg.txt";

    //document to POS tagging
    String content = getFileContent(contentFilePath);

    Properties props = new Properties();

    props.setProperty("annotators","tokenize, ssplit, pos");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    // Annotate the document.
    Annotation doc = new Annotation(content);
    pipeline.annotate(doc);


    // Annotate the document.
    List<CoreMap> sentences = doc.get(CoreAnnotations.SentencesAnnotation.class);
    for (CoreMap sentence : sentences) {
        for (CoreLabel token: sentence.get(CoreAnnotations.TokensAnnotation.class)) {
            String word = token.get(CoreAnnotations.TextAnnotation.class);
            // this is the POS tag of the token
            String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);
            System.out.println(word + "/" + pos);
        } }}}

Solution

  • You can simply iterate over your sentence and check for the POS tags. If they match your requirements, you can extract this structure. The code for that could look like this:

    for (CoreMap sentence : doc.get(CoreAnnotations.SentencesAnnotation.class)) { 
        List<CoreLabel> tokens = sentence.get(TokensAnnotation.class);
        for(int i = 0; i < tokens.size() - 3; i++) {
            String pos = tokens.get(i).get(PartOfSpeechAnnotation.class);
            if(pos.equals("NN") || pos.equals("NNS")) {
                pos = tokens.get(i + 1).getString(PartOfSpeechAnnotation.class);
                if(pos.equals("IN")) {
                    pos = tokens.get(i + 2).getString(PartOfSpeechAnnotation.class);
                    if(pos.equals("DT")) {
                        pos = tokens.get(i + 3).getString(PartOfSpeechAnnotation.class);
                        if(pos.contains("NN")) {
                            //We have a match starting at index i and ending at index i + 3
                            String word1 = tokens.get(i).getString(TextAnnotation.class);
                            String word2 = tokens.get(i + 1).getString(TextAnnotation.class);
                            String word3 = tokens.get(i + 2).getString(TextAnnotation.class);
                            String word4 = tokens.get(i + 3).getString(TextAnnotation.class);
                            System.out.println(word1 + " " + word2 + " " + word3 + " " + word4);
                        }
                    }
                }
            }
        }   
    }