Search code examples
pythonnltkdistributionfrequency

Group nltk.FreqDist output by first word (python)


I'm an amateur with basic coding skills in python, I'm working on a data frame that has a column as below. The intent is to group the output of nltk.FreqDist by the first word

Column in Dataframe

What I have so far

t_words = df_tech['message']
data_analysis = nltk.FreqDist(t_words)

# Let's take the specific words only if their frequency is greater than 3.
filter_words = dict([(m, n) for m, n in data_analysis.items() if len(m) > 3])

for key in sorted(filter_words):
    print("%s: %s" % (key, filter_words[key]))

sample current output:
click full refund showing currently viewed rr number: 1
click go: 1
click post refund: 1
click refresh like  replace tokens sending: 1
click refund: 1
click refund order: 1
click resend email confirmation: 1
click responsible party: 1
click send right: 1
click tick mark right: 1

I have 10000+ rows in my output.

My Expected Output

I would like to group the output by the first word and extract it as a dataframe

First Word as Header

What I have tried among other solutions

I have tried adapting solutions given here and here, but no satisfactory results.

Any help/guidance appreciated.


Solution

  • I managed to do it like below. There could be an easier implementation. But for now, this gives me what I had expected.

    temp = pd.DataFrame(sorted(data_analysis.items()), columns=['word', 'frequency'])
    temp['word'] = temp['word'].apply(lambda x: x.strip())
    
    #Removing emtpy rows
    filter = temp["word"] != ""
    dfNew = temp[filter]
    
    #Splitting first word
    dfNew['first_word'] = dfNew.word.str.split().str.get(0)
    #New column with setences split without first word
    dfNew['rest_words'] = dfNew['word'].str.split(n=1).str[1]
    #Subsetting required columns
    dfNew = dfNew[['first_word','rest_words']]
    # Grouping by first word
    dfNew= dfNew.groupby('first_word').agg(lambda x: x.tolist()).reset_index()
    #Transpose
    dfNew.T
    

    Sample Output

    One Column