Search code examples
javastanford-nlptext-extraction

How to get NN andNNS from a text?


I want to get NN or NNS from a sample text as given within the script below. To this end, when I use the code below, the output is:

types
synchronization
phase
synchronization
-RSB-
synchronization
-LSB-
-RSB-
projection
synchronization

Here why am I getting [-RSB-] or [-LSB-]? Should I use a different pattern to get NN or NNS at the same time?

                atic = "So far, many different types of synchronization have been investigated, such as complete synchronization [8], generalized synchronization [9], phase synchronization [10], lag synchronization [11], projection synchronization [12, 13], and so forth.";

Reader reader = new StringReader(atic);
DocumentPreprocessor dp = new DocumentPreprocessor(reader);        
docs_terms_unq.put(rs.getString("u"), new ArrayList<String>());
docs_terms.put(rs.getString("u"), new ArrayList<String>());

for (List<HasWord> sentence : dp) {

List<TaggedWord> tagged = tagger.tagSentence(sentence);
GrammaticalStructure gs = parser.predict(tagged);


Tree x = parserr.parse(sentence); 
System.out.println(x);
TregexPattern NPpattern = TregexPattern.compile("@NN|NNS");
TregexMatcher matcher = NPpattern.matcher(x);


while (matcher.findNextMatchingNode()) {

Tree match = matcher.getMatch();
ArrayList hh = match.yield();    
Boolean b = false;

System.out.println(hh.toString());}

Solution

  • I do not know why those are coming up. But you will get more accurate POS tags if you use the part of speech tagger. I would suggest just looking directly at the Annotation. Here is some sample code.

    import edu.stanford.nlp.ling.CoreAnnotations;
    import edu.stanford.nlp.ling.CoreLabel;
    import edu.stanford.nlp.pipeline.Annotation;
    import edu.stanford.nlp.pipeline.StanfordCoreNLP;
    import edu.stanford.nlp.util.CoreMap;
    
    import java.util.Properties;
    
    public class NNExample {
    
        public static void main(String[] args) {
            Properties props = new Properties();
            props.setProperty("annotators", "tokenize,ssplit,pos");
            StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
            String text = "So far, many different types of synchronization have been investigated, such as complete " +
                    "synchronization [8], generalized synchronization [9], phase synchronization [10], " +
                    "lag synchronization [11], projection synchronization [12, 13], and so forth.";
            Annotation annotation = new Annotation(text);
            pipeline.annotate(annotation);
            for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
                for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
                    String partOfSpeechTag = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);
                    if (partOfSpeechTag.equals("NN") || partOfSpeechTag.equals("NNS")) {
                        System.out.println(token.word());
                    }
                }
            }
        }
    }
    

    And the output I get.

    types
    synchronization
    synchronization
    synchronization
    phase
    synchronization
    lag
    synchronization
    projection
    synchronization