Search code examples
java-7stanford-nlplemmatization

Manual tagging of Words using Stanford CorNLP


I have a resource where i know exactly the types of words. i have to lemmatize them but for correct results, i have to manually tag them. i could not find any code for manual tagging of words. i m using following code but it returns wrong result. i.e "painting" for "painting" where i expect "paint".

*//...........lemmatization starts........................

Properties props = new Properties(); 
props.put("annotators", "tokenize, ssplit, pos, lemma"); 
StanfordCoreNLP pipeline = new StanfordCoreNLP(props, false);
String text = "painting"; 
Annotation document = pipeline.process(text);  

List<edu.stanford.nlp.util.CoreMap> sentences = document.get(SentencesAnnotation.class);

for(edu.stanford.nlp.util.CoreMap sentence: sentences) 

{    
    for(CoreLabel token: sentence.get(TokensAnnotation.class))
    {       
        String word = token.get(TextAnnotation.class);      
        String lemma = token.get(LemmaAnnotation.class); 
        System.out.println("lemmatized version :" + lemma);
    }
}

//...........lemmatization ends.........................*

i have to run lemmatizer on words and not sentences where pos tagging will be done automatically. so i would first manually tag the words and then find their lemma. help with some sample code or reference to some site would be great help.


Solution

  • If you know the POS tags in advance you can get the lemmata the following way:

    Properties props = new Properties(); 
    props.put("annotators", "tokenize, ssplit"); 
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props, false);
    String text = "painting";
    
    Morphology morphology = new Morphology();
    
    Annotation document = pipeline.process(text);  
    
    List<edu.stanford.nlp.util.CoreMap> sentences = document.get(SentencesAnnotation.class);
    
    for(edu.stanford.nlp.util.CoreMap sentence: sentences) {
    
      for(CoreLabel token: sentence.get(TokensAnnotation.class)) {       
        String word = token.get(TextAnnotation.class);
        String tag = ... //get the tag for the current word from somewhere, e.g. an array
        String lemma = morphology.lemma(word, tag);
        System.out.println("lemmatized version :" + lemma);
      }
    }
    

    In case you only want to get the lemma of a single word, you don't even have to run CoreNLP for tokenizing and sentence-splitting, so you could just call the lemma function as following:

    String tag = "VBG";      
    String word = "painting";
    Morphology morphology = new Morphology();
    String lemma = morphology.lemma(word, tag);