Search code examples
pythonpandastext-miningstemmingporter-stemmer

How to perfom stemming and drop columns in pandas dataframe in python?


Below is the subset of my dataset. I am trying to clean my dataset using Porter stemmer that is available in nltk package. I would like to drop columns that are similar in their stems for example "abandon','abondoned','abondening' should be just abondoned in my dataset. Below is the code I am trying, where I can see words/columns being stemmed. But I am not sure about how to drop those columns? I have already tokeninze and removed punctuation from the corpus.

Note: I am new to Python and Textmining.

Dataset Subset

{
   'aaaahhhs':{
      0:0,
      1:0,
      2:0,
      3:0,
      4:0,
      5:0
   },
   'aahs':{
      0:0,
      1:0,
      2:0,
      3:0,
      4:0,
      5:0
   },
   'aamir':{
      0:0,
      1:0,
      2:0,
      3:0,
      4:0,
      5:0
   },
   'aardman':{
      0:0,
      1:0,
      2:0,
      3:0,
      4:0,
      5:0
   },
   'aaron':{
      0:0,
      1:0,
      2:0,
      3:0,
      4:0,
      5:0
   },
   'abandon':{
      0:0,
      1:0,
      2:0,
      3:0,
      4:0,
      5:0
   },
   'abandoned':{
      0:0,
      1:0,
      2:0,
      3:0,
      4:0,
      5:0
   },
   'abandoning':{
      0:0,
      1:0,
      2:0,
      3:0,
      4:0,
      5:0
   },
   'abandonment':{
      0:0,
      1:0,
      2:0,
      3:0,
      4:0,
      5:0
   },
   'abandons':{
      0:0,
      1:0,
      2:0,
      3:0,
      4:0,
      5:0
   }
}

code so far..

from nltk.stem import PorterStemmer 
from nltk.tokenize import word_tokenize   
ps = PorterStemmer() 
for w in clean_df.columns:
    print(ps.stem(w))

Solution

  • I think something like this does what you want:

    import collections
    
    # Here the assotiations between stems and column names are built:
    stems = collections.defaultdict(list)
    for column_name in clean_df.columns:
        stems[ps.stem(column_name)].append(column_name)
    
    # Here for each stem the first (in lexicographical order) is gotten:
    new_columns = [sorted(columns)[0] for _, columns in stems.items()]
    
    # Here the new `DataFrame` is created which contains selected columns:
    new_df = clean_df[new_columns]