Search code examples
pythonpandasutf-8nltkhebrew

Count of most 'two words combination' popular Hebrew words in a pandas Dataframe with nltk


I have a csv data file containing column 'notes' with satisfaction answers in Hebrew.

I want to find the most popular words and popular '2 words combination', the number of times they show up and plotting them in a bar chart.

My code so far:

PYTHONIOENCODING="UTF-8"  
df= pd.read_csv('keep.csv', encoding='utf-8' , usecols=['notes'])
words= df.notes.str.split(expand=True).stack().value_counts()

This produce a list of the words with a counter but takes into account all the stopwords in Hebrew and don't produce '2 words combination' frequencies. I also tried this code and it's not what I'm looking for:

 top_N = 30
 txt = df.notes.str.lower().str.replace(r'\|', ' ').str.cat(sep=' ')
 words = nltk.tokenize.word_tokenize(txt)
 word_dist = nltk.FreqDist(words)
 rslt = pd.DataFrame(word_dist.most_common(top_N),
                columns=['Word', 'Frequency'])
 print(rslt)
 print('=' * 60)

How can I use nltk to do that?


Solution

  • Use nltk.util.bigrams:

    Solution for count bigrams from all values:

    df = pd.DataFrame({'notes':['aa bb cc','cc cc aa aa']})
    
    top_N = 3
    txt = df.notes.str.lower().str.replace(r'\|', ' ').str.cat(sep=' ')
    words = nltk.tokenize.word_tokenize(txt)
    
    bigrm = list(nltk.bigrams(words))
    print (bigrm)
    [('aa', 'bb'), ('bb', 'cc'), ('cc', 'cc'), ('cc', 'cc'), ('cc', 'aa'), ('aa', 'aa')]
    
    word_dist = nltk.FreqDist([' '.join(x) for x in bigrm])
    rslt = pd.DataFrame(word_dist.most_common(top_N),
                    columns=['Word', 'Frequency'])
    print(rslt)
        Word  Frequency
    0  cc cc          2
    1  aa bb          1
    2  bb cc          1
    

    Solution for bigrams per each splitted value of column:

    df = pd.DataFrame({'notes':['aa bb cc','cc cc aa aa']})
    
    top_N = 3
    f = lambda x: list(nltk.bigrams(nltk.tokenize.word_tokenize(x)))
    b = df.notes.str.lower().str.replace(r'\|', ' ').apply(f)
    print (b)
    
    word_dist = nltk.FreqDist([' '.join(y) for x in b for y in x])
    rslt = pd.DataFrame(word_dist.most_common(top_N),
                        columns=['Word', 'Frequency'])
    print(rslt)
        Word  Frequency
    0  aa bb          1
    1  bb cc          1
    2  cc cc          1
    

    If need count bigrams with separate words:

    top_N = 3
    f = lambda x: list(nltk.everygrams(nltk.tokenize.word_tokenize(x, 1, 2)))
    b = df.notes.str.lower().str.replace(r'\|', ' ').apply(f)
    print (b)
    
    word_dist = nltk.FreqDist([' '.join(y) for x in b for y in x])
    rslt = pd.DataFrame(word_dist.most_common(top_N),
                        columns=['Word', 'Frequency'])
    

    and last plot by DataFrame.plot.bar:

    rslt.plot.bar(x='Word', y='Frequency')