Search code examples
pythonscikit-learntext-miningfeature-extractiontext-classification

Sklearn - feature extraction from text - normalize text features by merging plural and singular forms


I am doing some text classification right now using sklearn.

As first step I obviously need to use vectorizer - either CountVectorizer or TfIdfVectorizer. The issue which I want to tackle is that in my documents often times I have singular and plural forms of same word. When performing vectorization I want to 'merge' singular and plural forms and treat them as a same text feature.

Obviously I can manually pre-process texts and just replace all plural word forms with singular word forms when I know which words have this issue. But maybe there is some way to do it in a more automated way, when words which are extremely similar to each other are merged into same feature?

UPDATE.

Based on the answer provided earlier, I needed to perform a stemming. Below is a sample code which stems all words in 'review' column of a dataframe DF, which I then use in vectorization and classification. Just in case anyone finds it useful.

from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer("english")


df['review_token']=df['review'].apply(lambda x : filter(None,x.split(" ")))

df['review_stemmed']=df['review_token'].apply(lambda x : [stemmer.stem(y) for y in x])

df['review_stemmed_sentence']=df['review_stemmed'].apply(lambda x : " ".join(x))

Solution

  • I think what you need is stemming, namely removing the endings of words that have a common root, and it's one of the basic operations in preprocessing text data.

    Here's some rules for stemming and lemmatization explained: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html