Search code examples
pythonpandastweepy

How can i hourly resample a dataframe that has a column of tweets in it? (I would like to concatenate all tweets per hour)


i have a Dataframe that has date time as index and tweets in a different column as well as other stats like number of likes. I would like to resample the df with an hourly interval so that i would get all tweets and the sum of all stats per hour, which i have done with the following code:

df.resample('60min').sum()

The problem is that my tweets column disappears.. And i need it for a sentiment analysis. I'm new to programming so thanks in advance for reading this!


Solution

  • IIUC you will groupby and use agg

    import numpy as np
    import pandas as pd
    # sample data
    np.random.seed(1)
    df = pd.DataFrame(np.transpose([np.random.randint(1,10, 1489), ['abc']*1489]),
                      index=pd.date_range('2020-01-01', '2020-02-01', freq='30T'),
                      columns=['num', 'tweet'])
    
    # groupby the index floored to hour, sum the num col 
    # and join the tweets with a semi-colon or what ever you want
    df.groupby(df.index.floor('H')).agg({'num': sum, 'tweet': '; '.join})
    
                        num     tweet
    2020-01-01 00:00:00  69  abc; abc
    2020-01-01 01:00:00  61  abc; abc
    2020-01-01 02:00:00  12  abc; abc
    2020-01-01 03:00:00  87  abc; abc
    2020-01-01 04:00:00  35  abc; abc
    

    Or if you just want to join the strings as is then sum everything:

    df.groupby(df.index.floor('H')).agg(sum)
    
                        num   tweet
    2020-01-01 00:00:00  69  abcabc
    2020-01-01 01:00:00  61  abcabc
    2020-01-01 02:00:00  12  abcabc
    2020-01-01 03:00:00  87  abcabc
    2020-01-01 04:00:00  35  abcabc