Search code examples
pythonpandasdataframetransformationbucket

Pandas trying to make values within a column into new columns after groupby on column


My original dataframe looked like:

    timestamp                     variables     value

1   2017-05-26 19:46:41.289       inf           0.000000
2   2017-05-26 20:40:41.243       tubavg        225.489639
... ... ... ...
899541  2017-05-02 20:54:41.574   caspre        684.486450
899542  2017-04-29 11:17:25.126   tvol          50.895000

Now I want to bucket this dataset by time, which can be done with the code:

df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.groupby(pd.Grouper(key='timestamp', freq='5min'))

But I also want all the different metrics to become columns in the new dataframe. For example the first two rows from the original dataframe would look like:

       timestamp                     inf         tubavg         caspre         tvol      ...

1      2017-05-26 19:46:41.289       0.000000    225.489639     xxxxxxx        xxxxx
... ... ... ...
xxxxx  2017-05-02 20:54:41.574       xxxxxx      xxxxxx         684.486450     50.895000

Now as it can be seen the time has been bucketed by 5 min intervals and will look at all the values of variables and try to create columns for those columns for all buckets. The bucket has assumed the very first value of the time it had bucketed with.

in order to solve this, I have tried a couple of different solutions, but can't seem to find anything without constant errors.


Solution

    1. Try unstacking the variables column from rows to columns with .unstack(1). The parameter is 1, because we want the second index column (0 would be the first)
    2. Then, drop the level of the multi-index you just created to make it a little bit cleaner with .droplevel().
    3. Finally, use pd.Grouper. Since the date/time is on the index, you don't need to specify a key.

    df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
    df = df.set_index(['timestamp','variables']).unstack(1)
    df.columns = df.columns.droplevel()
    df = df.groupby(pd.Grouper(freq='5min')).mean().reset_index()
    df
    Out[1]: 
    variables           timestamp  caspre  inf      tubavg    tvol
    0         2017-04-29 11:15:00     NaN  NaN         NaN  50.895
    1         2017-04-29 11:20:00     NaN  NaN         NaN     NaN
    2         2017-04-29 11:25:00     NaN  NaN         NaN     NaN
    3         2017-04-29 11:30:00     NaN  NaN         NaN     NaN
    4         2017-04-29 11:35:00     NaN  NaN         NaN     NaN
                          ...     ...  ...         ...     ...
    7885      2017-05-26 20:20:00     NaN  NaN         NaN     NaN
    7886      2017-05-26 20:25:00     NaN  NaN         NaN     NaN
    7887      2017-05-26 20:30:00     NaN  NaN         NaN     NaN
    7888      2017-05-26 20:35:00     NaN  NaN         NaN     NaN
    7889      2017-05-26 20:40:00     NaN  NaN  225.489639     NaN
    

    Another way would be to .groupby the variables as well and then .unstack(1) again:

    df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
    df = df.groupby([pd.Grouper(freq='5min', key='timestamp'), 'variables']).mean().unstack(1)
    df.columns = df.columns.droplevel()
    df = df.reset_index()
    df
    Out[1]: 
    variables           timestamp     caspre  inf      tubavg    tvol
    0         2017-04-29 11:15:00        NaN  NaN         NaN  50.895
    1         2017-05-02 20:50:00  684.48645  NaN         NaN     NaN
    2         2017-05-26 19:45:00        NaN  0.0         NaN     NaN
    3         2017-05-26 20:40:00        NaN  NaN  225.489639     NaN