Search code examples
pythonpandasdataframe

chang pandas dataframe from long to wide


Have a dataframe in following format

data = {'regions':["USA", "USA", "USA", "FRANCE", "FRANCE","FRANCE"],
        'dates':['2024-08-03', '2024-08-10', '2024-08-17','2024-08-03', '2024-08-10', '2024-08-17'], 
        'values': [3, 4, 5, 7, 8,0],
        }
df = pd.DataFrame(data)
    regions     dates   values
0   USA     2024-08-03  3
1   USA     2024-08-10  4
2   USA     2024-08-17  5
3   FRANCE  2024-08-03  7
4   FRANCE  2024-08-10  8
5   FRANCE  2024-08-17  0

Need to change this df from long to wide format. Use the most recent dates as current date, and the other two dates will be lagged dates. Expected output is like

regions       dates       values_lag2     values_lag1  values
USA           2024-08-17  3               4             5
FRANCE        2024-08-17  7               8             0

Currently I used a for loop to manually to change the format. Just wondering if there is a more elegant way to realize it. Thanks


Solution

  • If same dates per each regions is possible convert column to datetimes, pivoting, change columns names and add column with maximal dates:

    df['dates'] = pd.to_datetime(df['dates'])
    
    out = df.pivot(index='regions', columns='dates', values='values')
    out.columns = [f'values_lag{i-1}' if i!=1 else 'values' 
                   for i in range(len(out.columns), 0, -1)]
    
    out = df.groupby('regions')['dates'].max().to_frame().join(out).reset_index()
    print (out)
      regions       dates  values_lag2  values_lag1  values
    0  FRANCE  2024-08-17            7            8       0
    1     USA  2024-08-17            3            4       5
    

    Another idea if possible different datetimes and need only ordering add sorting, counter by groupby.cumcount and pivoting with helper column g:

    df['dates'] = pd.to_datetime(df['dates'])
    df = df.sort_values(['regions', 'dates'])
    
    df['g'] = df.groupby('regions').cumcount(ascending=False)
    
    out = (df.pivot(index='regions', columns='g', values='values')
             .sort_index(ascending=False, axis=1))
    out.columns=[f'values_lag{i}' if i!=0 else 'values' for i in out.columns]
    
    out = df.groupby('regions')['dates'].max().to_frame().join(out).reset_index()
    print (out)
      regions      dates  values_lag2  values_lag1  values
    0  FRANCE 2024-08-17            7            8       0
    1     USA 2024-08-17            3            4       5