Search code examples
pythonpandaspivotreshapegroup-summaries

Reshaping data with dates as column values


I am trying to reshape data using pandas and have been having a hard time getting it into the right format. Roughly, the data look like this*:

df = pd.DataFrame({'PRODUCT':['1','2'],
          'DESIGN_START':[pd.Timestamp('2020-01-05'),pd.Timestamp('2020-01-17')],
          'DESIGN_COMPLETE':[pd.Timestamp('2020-01-22'),pd.Timestamp('2020-03-04')],
          'PRODUCTION_START':[pd.Timestamp('2020-02-07'),pd.Timestamp('2020-03-15')],
          'PRODUCTION_COMPLETE':[np.nan,pd.Timestamp('2020-04-28')]})
print(df)

  PRODUCT DESIGN_START DESIGN_COMPLETE PRODUCTION_START PRODUCTION_COMPLETE
0       1   2020-01-05      2020-01-22       2020-02-07                 NaT
1       2   2020-01-17      2020-03-04       2020-03-15          2020-04-28

I would like to reshape the data so that it looks like this:

reshaped_df = pd.DataFrame({'DATE':[pd.Timestamp('2020-01-05'),pd.Timestamp('2020-01-17'),
                          pd.Timestamp('2020-01-22'),pd.Timestamp('2020-03-04'),
                          pd.Timestamp('2020-02-07'),pd.Timestamp('2020-03-15'),
                          np.nan,pd.Timestamp('2020-04-28')],
                  'STAGE':['design','design','design','design','production','production','production','production'],
                  'STATUS':['started','started','completed','completed','started','started','completed','completed']})

print(reshaped_df)

        DATE       STAGE     STATUS
0 2020-01-05      design    started
1 2020-01-17      design    started
2 2020-01-22      design  completed
3 2020-03-04      design  completed
4 2020-02-07  production    started
5 2020-03-15  production    started
6        NaT  production  completed
7 2020-04-28  production  completed

How can I go about doing this? Is there a better format to reshape it to?

Ultimately I'd like to do some group summaries on the data, such as the number of times each step occurred, e.g.

reshaped_df.groupby(['STAGE','STATUS'])['DATE'].count()

STAGE       STATUS   
design      completed    2
            started      2
production  completed    1
            started      2
Name: DATE, dtype: int64

Thank you

  • The data actually contain many date start/stop columns for different stages of the manufacturing pipeline

Solution

  • MELT IT!!!

    import pandas as pd
    import numpy as np
    
    df = pd.DataFrame({
        'PRODUCT':['1','2'],
        'DESIGN_START':[pd.Timestamp('2020-01-05'),pd.Timestamp('2020-01-17')],
        'DESIGN_COMPLETE':[pd.Timestamp('2020-01-22'),pd.Timestamp('2020-03-04')],
        'PRODUCTION_START':[pd.Timestamp('2020-02-07'),pd.Timestamp('2020-03-15')],
        'PRODUCTION_COMPLETE':[np.nan,pd.Timestamp('2020-04-28')]
    })
    
    df = df.melt(id_vars=['PRODUCT'])
    df_split = df['variable'].str.split('_', n=1, expand=True)
    df['STAGE'] = df_split[0]
    df['STATUS'] = df_split[1]
    df.drop(columns=['variable'], inplace=True)
    df = df.rename(columns={'value': 'DATE'})
    
    print(df)
    

    Output:

      PRODUCT       DATE       STAGE    STATUS
    0       1 2020-01-05      DESIGN     START
    1       2 2020-01-17      DESIGN     START
    2       1 2020-01-22      DESIGN  COMPLETE
    3       2 2020-03-04      DESIGN  COMPLETE
    4       1 2020-02-07  PRODUCTION     START
    5       2 2020-03-15  PRODUCTION     START
    6       1        NaT  PRODUCTION  COMPLETE
    7       2 2020-04-28  PRODUCTION  COMPLETE
    

    MWAHAHAHAHAHAHA!!! FEEL THE POWER OF THE MELT!!!

    Melt is basically unpivot