Count the number of occurrences of each word in a file and load into pandas

How do I count the number of occurrences of each word in a .txt file and also load it into the pandas dataframe with columns name and count, also sort the dataframe on column count?

Solution

Use nltk:

# pip install nltk
from nltk.tokenize import RegexpTokenizer
from nltk import FreqDist
import pandas as pd

text = """How do I count the number of occurrences of each word in a .txt file and also load it into the pandas dataframe with columns name and count, also sort the dataframe on column count?"""

tokenizer = RegexpTokenizer(r'\w+')
words = tokenizer.tokenize(text)

sr = pd.Series(FreqDist(words))

Output:

>>> sr
How            1
do             1
I              1
count          3
the            3
number         1
of             2
occurrences    1
each           1
word           1
in             1
a              1
txt            1
file           1
and            2
also           2
load           1
it             1
into           1
pandas         1
dataframe      2
with           1
columns        1
name           1
sort           1
on             1
column         1
dtype: int64