i have a Dataframe that has date time as index and tweets in a different column as well as other stats like number of likes. I would like to resample the df with an hourly interval so that i would get all tweets and the sum of all stats per hour, which i have done with the following code:
df.resample('60min').sum()
The problem is that my tweets column disappears.. And i need it for a sentiment analysis. I'm new to programming so thanks in advance for reading this!
IIUC you will groupby
and use agg
import numpy as np
import pandas as pd
# sample data
np.random.seed(1)
df = pd.DataFrame(np.transpose([np.random.randint(1,10, 1489), ['abc']*1489]),
index=pd.date_range('2020-01-01', '2020-02-01', freq='30T'),
columns=['num', 'tweet'])
# groupby the index floored to hour, sum the num col
# and join the tweets with a semi-colon or what ever you want
df.groupby(df.index.floor('H')).agg({'num': sum, 'tweet': '; '.join})
num tweet
2020-01-01 00:00:00 69 abc; abc
2020-01-01 01:00:00 61 abc; abc
2020-01-01 02:00:00 12 abc; abc
2020-01-01 03:00:00 87 abc; abc
2020-01-01 04:00:00 35 abc; abc
Or if you just want to join the strings as is then sum everything:
df.groupby(df.index.floor('H')).agg(sum)
num tweet
2020-01-01 00:00:00 69 abcabc
2020-01-01 01:00:00 61 abcabc
2020-01-01 02:00:00 12 abcabc
2020-01-01 03:00:00 87 abcabc
2020-01-01 04:00:00 35 abcabc