I am new in Natural Language Processing (NLP), I want to do part-of-speech tagging (POS) and then do find a specific structure within a text. I could manage POS tagging using Stanford-NLP but, I do not know how to extract this structure:
NN/NNS + IN + DT + NN/NNS/NNP/NNPS
public static void main(String args[]) throws Exception{
//input File
String contentFilePath = "";
//outputFile
String triplesFilePath = contentFilePath.substring(0, contentFilePath.length()-4)+"_postagg.txt";
//document to POS tagging
String content = getFileContent(contentFilePath);
Properties props = new Properties();
props.setProperty("annotators","tokenize, ssplit, pos");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// Annotate the document.
Annotation doc = new Annotation(content);
pipeline.annotate(doc);
// Annotate the document.
List<CoreMap> sentences = doc.get(CoreAnnotations.SentencesAnnotation.class);
for (CoreMap sentence : sentences) {
for (CoreLabel token: sentence.get(CoreAnnotations.TokensAnnotation.class)) {
String word = token.get(CoreAnnotations.TextAnnotation.class);
// this is the POS tag of the token
String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);
System.out.println(word + "/" + pos);
} }}}
You can simply iterate over your sentence and check for the POS tags. If they match your requirements, you can extract this structure. The code for that could look like this:
for (CoreMap sentence : doc.get(CoreAnnotations.SentencesAnnotation.class)) {
List<CoreLabel> tokens = sentence.get(TokensAnnotation.class);
for(int i = 0; i < tokens.size() - 3; i++) {
String pos = tokens.get(i).get(PartOfSpeechAnnotation.class);
if(pos.equals("NN") || pos.equals("NNS")) {
pos = tokens.get(i + 1).getString(PartOfSpeechAnnotation.class);
if(pos.equals("IN")) {
pos = tokens.get(i + 2).getString(PartOfSpeechAnnotation.class);
if(pos.equals("DT")) {
pos = tokens.get(i + 3).getString(PartOfSpeechAnnotation.class);
if(pos.contains("NN")) {
//We have a match starting at index i and ending at index i + 3
String word1 = tokens.get(i).getString(TextAnnotation.class);
String word2 = tokens.get(i + 1).getString(TextAnnotation.class);
String word3 = tokens.get(i + 2).getString(TextAnnotation.class);
String word4 = tokens.get(i + 3).getString(TextAnnotation.class);
System.out.println(word1 + " " + word2 + " " + word3 + " " + word4);
}
}
}
}
}
}