In our software we have to analyze a plain text file. First we should break the text into paragraph, then into sentences, then into tokens. Final steps (as far as I understand) is the stemming
and lemmatization
.
If we have a text like this: We are singing great songs about heroes
I would love to see the tokens as [we
,be
,sing
,great
,song
,about
,hero
]. To achieve that - as I understand - we need some method to find the tokens in the original text somehow, but the hard part is to stemming/lemmatize it.
I know there is a python project NLTK or spaCy which are good at these things, but we need to use C# for this project. I searched for hours, but cannot find any available packages for this. Cannot believe it, so I must ask - are there any libs, or must somehow call these libraries from C# to do this?
Trying to do NLP outside of Python is a huge pain in my experience, but there are some libraries for it, e.g. https://github.com/curiosity-ai/catalyst, which seems to support lemmatization.
Since stemming is usually just an implementation of some basic rule-based algorithm, you can also either adapt some code from another programming language or copy an existing direct implementation like this one: https://github.com/nemec/porter2-stemmer