Search code examples
pandasgroup-bycount

Counting Distinct words AND average time in Pandas


I'm working on analysing some text from a Twitter API using pandas. This will eventually be visualized.

For reference df.head() of my dataset is:

Count   User    Time    Tweet
0   0   x   2022    ✔️Nécessité de maintien d’une filière 🇪🇺 dynam...
1   1   x   2022    Échanges approfondis à #Dakar avec le Premier ...
2   2   x   2022    ✔️Approvisionnement en #céréales & #engrai...
3   3   x   2022    Aujourd’hui à Tambacounda, à l’Est du Sénégal,...
4   4   x   2022    Working hard since 2019 to reinforce EU #auton...

I'm looking to return the distinct word count with the average time of the tweet where the word was used in.

Right now, I've been getting the distinct word count of my dataset using df.Tweet.str.split(expand=True).stack().value_counts().

This is useful, returning:

the                        1505
de                         1500
to                         1168
RT                          931
of                          906
                           ... 
africain,                     1
langue                        1
Félicitations!                1

Length: 18071, dtype: int64

However, I want to also analyse text usage over time.

I'm not super experienced so I'm wondering if there is a way to use a function such as df.groupby() to sort this result by time? Or, is there a way to modify my original function to add a column to my results that includes average time?


Solution

  • I would use str.extractall to get the words, join the Time, then perform a groupby.value_counts to get the count per Year:

    out = (df['Tweet']
     .str.extractall('(\S+)')
     .droplevel('match')
     .join(df['Time'])
     .groupby('Time')[0].value_counts()
    )
    

    NB. if you want to exclude non-letters/digits from the words, use (\w+) in place of (\S+).

    Output:

    Time  0                  
    2022  à                      3
          #Dakar                 1
          #auton...              1
          #céréales              1
          #engrai...             1
          &                  1
          ...                    1
    ...