How do I distinguish between function/structure words and content/lexical words?
I am already using StanfordCoreNLP, so I would like to leverage it, if possible.
More specifically, which annotator should I use and how would it mark content/lexical words?
I tried pos
but it does not distinguish between function and content words.
PS. I use the lemma
annotator to get the words which I want to ignore.
PPS. I use pyconlp
.
Function words (stop words) are often manually curated because they vary by domain. You can find a general purpose list in NLTK. CoreNLP also has one here
from nltk.corpus import stopwords
stops = stopwords.words('english')
However, you should still look at them to see if they make sense for you use case. I have recently been working with technical language, so I removed 'it' from my list because 'IT' is an acronym in this domain and thus a content word.
For your annotator, you could go with the general purpose TokenizerAnnotator which will split your text into "words". You can then check each word to see if it exists in your stopword list. If you are working with strings, just try splitting them on whitespace and removing or marking stopwords as a gut check.