Search code examples
pythonpandasgroup-by

Combining rows in Pandas to one long row


I have the following sample DF

import pandas as pd
import numpy as np

# Create a range of timestamps for 100 consecutive days starting from today
timestamps = pd.date_range(start=pd.Timestamp.now().floor('H'), periods=100, freq='H')

# Create a DataFrame with 100 rows and 3 columns
df = pd.DataFrame({'timestamp': timestamps,
                   # 'value1': np.random.randn(100),
                   # 'value2': np.random.randint(0, 10, 100)})
                   'value1': 'abc', 'value2': 'def'})

# Group the rows by day and apply a custom function that concatenates values
df = df.groupby(df['timestamp'].dt.date).agg({'timestamp': 'first',
                                               'value1': lambda x: ', '.join(x),
                                               'value2': lambda x: ', '.join(x),
                                              }).reset_index(drop=True)
print(df.head())

Now I've commented out the value1 and value2 in the creating of the dataframe to try and do this with strings first. Eventually I need it to work with floats. Now it combines the strings to one long comma seperated string in one column, which is I guess one tiny step closer to what I want.

What I'm trying to achieve is to create a DF which combines all rows of 1 day to one massive row, with renamed column names, so columns should look something like this: timestamp, value1_00, value2_00, value1_01, value2_01, ... , value2_23

Now I've tried a little bit, but because I'm quite new to pandas I'm not entirely sure where to begin, what would a good start be to tackle this problem? One thing I'm considering is to create a little for loop to create the new columns, since that can be a simple loop like:

for i in range(0,24):
    if i < 10:
        new_column1 = 'value1_0' + str(i)
        new_column2 = 'value2_0' + str(i)
        new_columns = [new_column1, new_column2]
        df[new_columns] = np.nan
    else:
        new_column1 = 'value1_' + str(i)
        new_column2 = 'value2_' + str(i)
        new_columns = [new_column1, new_column2]
        df[new_columns] = np.nan

But how would I add the correct value to the correct new column? I'm sure it can be done by creating a couple of functions, but surely there's a more convenient way!


Solution

  • IIUC, you can use pivot :

    df = pd.DataFrame({'timestamp': timestamps, 'value1': 'abc', 'value2': 'def'})
    ​
    out = (df.assign(date= df['timestamp'].dt.date, hour= df['timestamp'].dt.hour)
                .pivot(index='date', columns='hour', values=['value1', 'value2']))
    ​
    out.columns = [f'{col[0]}_{col[1]:02d}' for col in out.columns]
    ​
    out = out.reset_index()
    

    ​ Output :

    print(out)
    
             date value1_00 value1_01  ... value2_21 value2_22 value2_23
    0  2023-04-14       NaN       NaN  ...       NaN       NaN       def
    1  2023-04-15       abc       abc  ...       def       def       def
    2  2023-04-16       abc       abc  ...       def       def       def
    3  2023-04-17       abc       abc  ...       def       def       def
    4  2023-04-18       abc       abc  ...       def       def       def
    5  2023-04-19       abc       abc  ...       NaN       NaN       NaN
    
    [6 rows x 49 columns]