Search code examples
pythonpandasduplicatesnoaa

Concat dataframes: give unique names to columns and drop duplicates


I am iterating through monthly weather station data. I can concatenate the files as follows:

path = r"D:\NOAA\output\TEST" 
all_files = glob.glob(path + "/*.csv")

for filename in all_files:
    print filename # prints D:\NOAA\output\TEST\189501.tave.conus.csv
    df = (pd.read_csv(f) for f in all_files)
    concatenated_df = pd.concat(df, axis=1, join='inner')

Which results in the following dataframe:

     lat      lon     temp    lat      lon     temp    lat      lon    temp
0  24.5625 -81.8125  21.06  24.5625 -81.8125  17.08  24.5625 -81.8125  22.42
1  24.5625 -81.7708  21.06  24.5625 -81.7708  17.08  24.5625 -81.7708  22.47
2  24.5625 -81.7292  21.06  24.5625 -81.7292  17.08  24.5625 -81.7292  22.47
3  24.5625 -81.6875  21.05  24.5625 -81.6875  17.04  24.5625 -81.6875  22.47
4  24.6042 -81.6458  21.06  24.6042 -81.6458  17.08  24.6042 -81.6458  22.45

The lat and lon columns are identical, so I would like to drop those duplicates columns. The temp columns are unique to each monthly CSV file. I want to keep all of them, but also give them meaningful column names, taken from the filename, ie:

        lat      lon   temp185901   temp185902   temp185903
0    24.5625  -81.8125   21.06        17.08        22.42
1    24.5625  -81.7708   21.06        17.08        22.47
2    24.5625  -81.7292   21.06        17.08        22.47
3    24.5625  -81.6875   21.05        17.04        22.47
4    24.6042  -81.6458   21.06        17.08        22.45

I am brand new to Pandas (it seems awesome, but it's a lot to absorb), I would appreciate any help. I think the solutions are in the parameters I use for either the .concat(), .duplicate(), and .loc().

Example data: ftp://ftp.commissions.leg.state.mn.us/pub/gis/Temp/NOAA/


Solution

  • You can merge two columns and set suffixes for others:

    temp = df1.merge(df2, on=['lat','lon'], suffixes=('185901','185902'))
    
           lat      lon  temp185901  temp185902
    0  24.5625 -81.8125       21.06       17.08
    1  24.5625 -81.7708       21.06       17.08
    2  24.5625 -81.7292       21.06       17.08
    3  24.5625 -81.6875       21.05       17.04
    4  24.6042 -81.6458       21.06       17.08
    

    Or in a loop

    temp.merge(df3, on=['lat','lon']).rename(columns={'temp':'temp185903'})
    
           lat      lon  temp185901  temp185902  temp185903
    0  24.5625 -81.8125       21.06       17.08       22.42
    1  24.5625 -81.7708       21.06       17.08       22.47
    2  24.5625 -81.7292       21.06       17.08       22.47
    3  24.5625 -81.6875       21.05       17.04       22.47
    4  24.6042 -81.6458       21.06       17.08       22.45
    

    df = []
    for filename in all_files:
        df1 = pd.read_csv(filename)
        # if the first loop
        if not list(df):
           df = df1
        else:
           df = df.merge(df1, on=['lat','lon'])
        df.rename(columns={'temp':'temp'+put_numer_from_filename}, inplace=True)