Search code examples
pythonpandasnltk

Counting the Frequency of words in a pandas data frame


I have a table like below:

      URN                   Firm_Name
0  104472               R.X. Yah & Co
1  104873        Big Building Society
2  109986          St James's Society
3  114058  The Kensington Society Ltd
4  113438      MMV Oil Associates Ltd

And I want to count the frequency of all the words within the Firm_Name column, to get an output like below:

enter image description here

I have tried the following code:

import pandas as pd
import nltk
data = pd.read_csv("X:\Firm_Data.csv")
top_N = 20
word_dist = nltk.FreqDist(data['Firm_Name'])
print('All frequencies')
print('='*60)
rslt=pd.DataFrame(word_dist.most_common(top_N),columns=['Word','Frequency'])

print(rslt)
print ('='*60)

However the following code does not produce a unique word count.


Solution

  • IIUIC, use value_counts()

    In [3361]: df.Firm_Name.str.split(expand=True).stack().value_counts()
    Out[3361]:
    Society       3
    Ltd           2
    James's       1
    R.X.          1
    Yah           1
    Associates    1
    St            1
    Kensington    1
    MMV           1
    Big           1
    &             1
    The           1
    Co            1
    Oil           1
    Building      1
    dtype: int64
    

    Or,

    pd.Series(np.concatenate([x.split() for x in df.Firm_Name])).value_counts()
    

    Or,

    pd.Series(' '.join(df.Firm_Name).split()).value_counts()
    

    For top N, for example 3

    In [3379]: pd.Series(' '.join(df.Firm_Name).split()).value_counts()[:3]
    Out[3379]:
    Society    3
    Ltd        2
    James's    1
    dtype: int64
    

    Details

    In [3380]: df
    Out[3380]:
          URN                   Firm_Name
    0  104472               R.X. Yah & Co
    1  104873        Big Building Society
    2  109986          St James's Society
    3  114058  The Kensington Society Ltd
    4  113438      MMV Oil Associates Ltd