Search code examples
javaregexnlpopennlp

Extract the noun words & original sentence from POS Tag


I want to extract the nouns from the sentence and get back the original sentence from the POS Tag

 //Extract the words before _NNP & _NN from below  and also how to get back the original sentence from the Pos TAG. 
 Original Sentence:Hi. How are you? This is Mike·
 POSTag: Hi._NNP How_WRB are_VBP you?_JJ This_DT is_VBZ Mike._NN

I tried something like this

    String txt = "Hi._NNP How_WRB are_VBP you?_JJ This_DT is_VBZ Mike._NN";


    String re1 = "((?:[a-z][a-z0-9_]*))";   // Variable Name 1
    String re2 = ".*?"; // Non-greedy match on filler
    String re3 = "(_)"; // Any Single Character 1
    String re4 = "(NNP)";   // Word 1

    Pattern p = Pattern.compile(re1 + re2 + re3 + re4, Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
    Matcher m = p.matcher(txt);
    if (m.find()) {
        String var1 = m.group(1);
        System.out.print(  var1.toString()  );
    }
}

output: Hi But I need a list of all the nouns in the sentence.


Solution

  • To extract the nouns, you can do this:

    public static String[] extractNouns(String sentenceWithTags) {
        // Split String into array of Strings whenever there is a tag that starts with "._NN"
        // followed by zero, one or two more letters (like "_NNP", "_NNPS", or "_NNS")
        String[] nouns = sentenceWithTags.split("_NN\\w?\\w?\\b");
        // remove all but last word (which is the noun) in every String in the array
        for(int index = 0; index < nouns.length; index++) {
            nouns[index] = nouns[index].substring(nouns[index].lastIndexOf(" ") + 1)
            // Remove all non-word characters from extracted Nouns
            .replaceAll("[^\\p{L}\\p{Nd}]", "");
        }
        return nouns;
    }
    

    To extract the original sentence, you can do this:

    public static String extractOriginal(String sentenceWithTags) {
        return sentenceWithTags.replaceAll("_([A-Z]*)\\b", "");
    }
    

    Proof that it works:

    public static void main(String[] args) {
        String sentence = "Hi._NNP How_WRB are_VBP you?_JJ This_DT is_VBZ Mike._NN";
        System.out.println(java.util.Arrays.toString(extractNouns(sentence)));
        System.out.println(extractOriginal(sentence));
    }
    

    Output:

    [Hi, Mike]
    Hi. How are you? This is Mike.
    

    Note: for the regex that removed all non-word characters (like punctuation) from the extracted nouns, I used this Stack Overflow question/answer.