I'm working on analysing some text from a Twitter API using pandas. This will eventually be visualized.
For reference
df.head()
of my dataset
is:
Count User Time Tweet
0 0 x 2022 ✔️Nécessité de maintien d’une filière 🇪🇺 dynam...
1 1 x 2022 Échanges approfondis à #Dakar avec le Premier ...
2 2 x 2022 ✔️Approvisionnement en #céréales & #engrai...
3 3 x 2022 Aujourd’hui à Tambacounda, à l’Est du Sénégal,...
4 4 x 2022 Working hard since 2019 to reinforce EU #auton...
I'm looking to return the distinct word count with the average time of the tweet where the word was used in.
Right now, I've been getting the distinct word count of my dataset using df.Tweet.str.split(expand=True).stack().value_counts()
.
This is useful, returning:
the 1505
de 1500
to 1168
RT 931
of 906
...
africain, 1
langue 1
Félicitations! 1
Length: 18071, dtype: int64
However, I want to also analyse text usage over time.
I'm not super experienced so I'm wondering if there is a way to use a function such as df.groupby() to sort this result by time? Or, is there a way to modify my original function to add a column to my results that includes average time?
I would use str.extractall
to get the words, join
the Time, then perform a groupby.value_counts
to get the count per Year:
out = (df['Tweet']
.str.extractall('(\S+)')
.droplevel('match')
.join(df['Time'])
.groupby('Time')[0].value_counts()
)
NB. if you want to exclude non-letters/digits from the words, use (\w+)
in place of (\S+)
.
Output:
Time 0
2022 à 3
#Dakar 1
#auton... 1
#céréales 1
#engrai... 1
& 1
... 1
...