I have a csv data file containing column 'notes' with satisfaction answers in Hebrew.
I want to find the most popular words and popular '2 words combination', the number of times they show up and plotting them in a bar chart.
My code so far:
PYTHONIOENCODING="UTF-8"
df= pd.read_csv('keep.csv', encoding='utf-8' , usecols=['notes'])
words= df.notes.str.split(expand=True).stack().value_counts()
This produce a list of the words with a counter but takes into account all the stopwords in Hebrew and don't produce '2 words combination' frequencies. I also tried this code and it's not what I'm looking for:
top_N = 30
txt = df.notes.str.lower().str.replace(r'\|', ' ').str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(txt)
word_dist = nltk.FreqDist(words)
rslt = pd.DataFrame(word_dist.most_common(top_N),
columns=['Word', 'Frequency'])
print(rslt)
print('=' * 60)
How can I use nltk to do that?
Use nltk.util.bigrams
:
Solution for count bigrams from all values:
df = pd.DataFrame({'notes':['aa bb cc','cc cc aa aa']})
top_N = 3
txt = df.notes.str.lower().str.replace(r'\|', ' ').str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(txt)
bigrm = list(nltk.bigrams(words))
print (bigrm)
[('aa', 'bb'), ('bb', 'cc'), ('cc', 'cc'), ('cc', 'cc'), ('cc', 'aa'), ('aa', 'aa')]
word_dist = nltk.FreqDist([' '.join(x) for x in bigrm])
rslt = pd.DataFrame(word_dist.most_common(top_N),
columns=['Word', 'Frequency'])
print(rslt)
Word Frequency
0 cc cc 2
1 aa bb 1
2 bb cc 1
Solution for bigrams per each splitted value of column:
df = pd.DataFrame({'notes':['aa bb cc','cc cc aa aa']})
top_N = 3
f = lambda x: list(nltk.bigrams(nltk.tokenize.word_tokenize(x)))
b = df.notes.str.lower().str.replace(r'\|', ' ').apply(f)
print (b)
word_dist = nltk.FreqDist([' '.join(y) for x in b for y in x])
rslt = pd.DataFrame(word_dist.most_common(top_N),
columns=['Word', 'Frequency'])
print(rslt)
Word Frequency
0 aa bb 1
1 bb cc 1
2 cc cc 1
If need count bigrams with separate words:
top_N = 3
f = lambda x: list(nltk.everygrams(nltk.tokenize.word_tokenize(x, 1, 2)))
b = df.notes.str.lower().str.replace(r'\|', ' ').apply(f)
print (b)
word_dist = nltk.FreqDist([' '.join(y) for x in b for y in x])
rslt = pd.DataFrame(word_dist.most_common(top_N),
columns=['Word', 'Frequency'])
and last plot by DataFrame.plot.bar
:
rslt.plot.bar(x='Word', y='Frequency')