Search code examples
c#nlpnltkstemminglemmatization

English text tokenization in C# not python is possible?


In our software we have to analyze a plain text file. First we should break the text into paragraph, then into sentences, then into tokens. Final steps (as far as I understand) is the stemming and lemmatization.

If we have a text like this: We are singing great songs about heroes I would love to see the tokens as [we,be,sing,great,song,about,hero]. To achieve that - as I understand - we need some method to find the tokens in the original text somehow, but the hard part is to stemming/lemmatize it.

I know there is a python project NLTK or spaCy which are good at these things, but we need to use C# for this project. I searched for hours, but cannot find any available packages for this. Cannot believe it, so I must ask - are there any libs, or must somehow call these libraries from C# to do this?


Solution

  • Trying to do NLP outside of Python is a huge pain in my experience, but there are some libraries for it, e.g. https://github.com/curiosity-ai/catalyst, which seems to support lemmatization.

    Since stemming is usually just an implementation of some basic rule-based algorithm, you can also either adapt some code from another programming language or copy an existing direct implementation like this one: https://github.com/nemec/porter2-stemmer