Search code examples
nlpstanford-nlp

Function vs Content Words


How do I distinguish between function/structure words and content/lexical words?

I am already using StanfordCoreNLP, so I would like to leverage it, if possible.

More specifically, which annotator should I use and how would it mark content/lexical words?

I tried pos but it does not distinguish between function and content words.

PS. I use the lemma annotator to get the words which I want to ignore.

PPS. I use pyconlp.


Solution

  • Function words (stop words) are often manually curated because they vary by domain. You can find a general purpose list in NLTK. CoreNLP also has one here

    from nltk.corpus import stopwords
    stops = stopwords.words('english')
    

    However, you should still look at them to see if they make sense for you use case. I have recently been working with technical language, so I removed 'it' from my list because 'IT' is an acronym in this domain and thus a content word.

    For your annotator, you could go with the general purpose TokenizerAnnotator which will split your text into "words". You can then check each word to see if it exists in your stopword list. If you are working with strings, just try splitting them on whitespace and removing or marking stopwords as a gut check.