Search code examples
pythonpandasdataframemergemulti-index

How to combine and pivot dataframes with different structures


I edited and reposted based on some advice I got about posting questions

I need some help combining 3 dataframes in a specific way.

So I start with 3 datasets

This is the first:

df_WL = pd.read_csv('Regional_Scale_GWL_data.csv')
df_WL

              Date           SiteNo      WL
0        8/20/1992          6203301      58
1        2/16/1993          6203301      57
2        2/23/1994          6203301      57
3       11/17/1994          6203301      58
4       11/16/1995          6203301      57
...            ...              ...     ...
784760   12/6/2017  334000000000000  258.22
784761   12/6/2017  334000000000000  258.22
784762   3/15/2018  334000000000000  258.43
784763   5/30/2018  334000000000000  258.34
784764         NaN  334000000000000     NaN

[784765 rows x 3 columns]

From that I create this dataframe:

df_WL['Date'] = pd.to_datetime(df_WL['Date'], errors='coerce')
df_WL['WL'] = pd.to_numeric(df_WL['WL'], errors='coerce')
df_WL['SiteNo'] = df_WL['SiteNo'].astype(str)
df_WL = df_WL.dropna(subset=['Date'])
df_WL = df_WL.dropna(subset=['WL'])

df_WL = df_WL.pivot_table(index='Date', columns=["SiteNo"], values=['WL']) \
    .reorder_levels([1, 0], axis=1) \
    .sort_index(axis=1)
df_WL

SiteNo     1021201 1023902  ... SA-0174 SA-0231 SM-0049 
                WL      WL  ...      WL      WL      WL 
Date                        ...                            
1970-01-01     NaN     NaN  ...     NaN     NaN     NaN    
1970-01-03     NaN     NaN  ...     NaN     NaN     NaN    
1970-01-05     NaN     NaN  ...     NaN     NaN     NaN    
1970-01-06     NaN     NaN  ...     NaN     NaN     NaN    
1970-01-07  3692.0     NaN  ...     NaN     NaN     NaN    
...            ...     ...  ...     ...     ...     ...    
2021-02-18     NaN     NaN  ...     NaN     NaN     NaN    
2021-02-19     NaN     NaN  ...     NaN     NaN     NaN    
2021-02-22     NaN     NaN  ...     NaN     NaN     NaN    
2021-02-23     NaN     NaN  ...     NaN     NaN     NaN    
2021-02-24     NaN     NaN  ...     NaN  7209.0     NaN  

[17353 rows x 863 columns]

The other two data sets that I have are:

df_precip = pd.read_csv('Regional_Scale_Precip.csv')
df_precip = df_precip.set_index('Date')
df_precip

            294957000000000  294722000000000  ...  6129203  6414105  
Date                                                                      
1981-01-01            0.000            0.000  ...    0.000    0.000       
1981-01-02            0.000            0.000  ...    0.000    0.000     
1981-01-03            0.000            0.000  ...    0.000    0.000      
1981-01-04            0.000            0.000  ...    0.000    0.000      
2017-05-27            0.000            0.000  ...    0.000    0.000       
...                     ...              ...  ...      ...      ...     
2017-05-22           13.529           15.883  ...   19.788   45.493      
2017-05-23           16.181           28.589  ...   36.448    8.722      
2017-05-24           13.189           16.917  ...   15.643   14.794     
2017-05-25            0.000            0.000  ...    0.000    0.000     
2017-05-26            0.000            0.000  ...    0.000    0.000    

[13295 rows x 1331 columns]

and

df_temp = pd.read_csv('Regional_Scale_Temp.csv')

df_temp = df_temp.set_index('Date')
df_temp

            6131901  6129203  ...  6414105  6155707
Date                                                                                 
1981-01-01    8.965    8.733  ...    9.117    9.118   
1981-01-02    6.654    6.614  ...    7.834    7.195   
1981-01-03    4.794    4.796  ...    4.826    4.880   
1981-01-04    7.582    7.752  ...    8.380    8.018   
2009-08-25   22.438   22.129  ...   23.607   22.702   
...             ...      ...  ...      ...      ...   
2009-08-20   27.354   27.177  ...   28.498   28.055   
2009-08-21   26.706   26.397  ...   28.671   27.479   
2009-08-22   25.126   24.778  ...   26.644   25.600   
2009-08-23   22.001   21.835  ...   23.803   22.543   
2009-08-24   21.626   21.422  ...   23.257   22.160  

[10463 rows x 1331 columns]

my goal is to create a dataframe that looks like this (I made this in excel with arbitrary values to illustrate my goal):

enter image description here

I've attempted to unstack the two last dataframes and combine them with the first one to create a one with 'Date', 'SiteNo', 'WL', 'Temp', and 'Precip' columns. Then pivot them to get what I am aiming for but that was a huge mess.

Any help would be appreciated. Thanks!


Solution

    1. All the DataFrames created from the files should be converted to a standard long form, with a 'Date', 'Site', 'Values', and 'Type' column.
    2. Use pandas.concat to combine all of the DataFrames
    3. Use pandas.DataFrame.pivot to achieve the desired form

    Load and Clean

    import pandas as pd
    
    # load the data
    wl = pd.read_csv('wl.csv', parse_dates=['Date'])
    pre = pd.read_csv('precip.csv', index_col='Date', parse_dates=['Date'])
    temp = pd.read_csv('temp.csv', index_col='Date', parse_dates=['Date'])
    
    # wl is already in a long form so clean the column names
    wl.rename({'SiteNo': 'Site', 'WL': 'Values'}, axis=1, inplace=True)
    
    # stack the other two dataframes into a long form
    pre = pre.stack().reset_index(name='Values').rename({'level_1': 'Site'}, axis=1)
    temp = temp.stack().reset_index(name='Values').rename({'level_1': 'Site'}, axis=1)
    
    # add a Type column
    wl['Type'] = 'WL'
    pre['Type'] = 'Precip'
    temp['Type'] = 'Temp'
    
    # sample of wl
            Date     Site  Values Type
    0 1992-08-20  6203301    58.0   WL
    1 1993-02-16  6203301    57.0   WL
    2 1994-02-23  6203301    57.0   WL
    3 1994-11-17  6203301    58.0   WL
    4 1995-11-16  6203301    57.0   WL
    

    Combine and Pivot

    • The provided sample data doesn't align well, so all 'Site' columns don't have all 'Type' columns in the pivoted DataFrame
    # combine the DataFrames
    df = pd.concat([wl, pre, temp])
    
    # drop duplicate rows - there shouldn't be any, but the sample data did
    df.drop_duplicates(inplace=True)
    
    # sort the values - not strictly necessary
    df = df.sort_values(['Date', 'Site', 'Type']).reset_index(drop=True)
    
    # dropna
    df.dropna(subset=['Date'], inplace=True)
    
    # pivot
    dfp = df.pivot(index='Date', columns=['Site', 'Type'], values='Values')
    
    # display(dfp.head())
    Site       294722000000000 294957000000000 6129203        6131901 6155707 6414105        6203301 334000000000000
    Type                Precip          Precip  Precip   Temp    Temp    Temp  Precip   Temp      WL              WL
    Date                                                                                                            
    1981-01-01             0.0             0.0     0.0  8.733   8.965   9.118     0.0  9.117     NaN             NaN
    1981-01-02             0.0             0.0     0.0  6.614   6.654   7.195     0.0  7.834     NaN             NaN
    1981-01-03             0.0             0.0     0.0  4.796   4.794   4.880     0.0  4.826     NaN             NaN
    1981-01-04             0.0             0.0     0.0  7.752   7.582   8.018     0.0  8.380     NaN             NaN
    1992-08-20             NaN             NaN     NaN    NaN     NaN     NaN     NaN    NaN    58.0             NaN