Search code examples
pythonpandasdataframetime-seriesdata-processing

How can I merge time series data from 2 different csv


I have 2 csv files, the first one contains timestamp of event. I want to add weather details from the 2nd csv file.

File 1 (irregular time stamps)

Time                        UserID_Detected             
2019-01-01 01:00:32          3                     
2019-01-02 06:02:12          2               
2019-01-02 14:45:32          7                    

File 2 (hourly weather)

Time                        Temperature  Humidity
1/1/2019   0:00             32            55        
1/1/2019   1:00             33            60  
1/1/2019   2:00             33            53    
.............
1/2/2019   6:00             35            58
.............
1/2/2019   15:00            38            62

I want to attach the weather variable to the first file. Since the first file contains minutes and second info, I will round to nearest hour. Expected output:

File 1 (final output)

Time                        UserID_Detected      Temperature     Humidity       
2019-01-01 01:00:32          3                     33             60
2019-01-02 06:02:12          2                     35             58
2019-01-02 14:45:32          7                     38             62          

Tried so far:

df_event = pd.read_csv("df1.csv") 
df_weather = pd.read_csv("df2.csv") 


df_event ['Time'] = pd.to_datetime(df_event['Time']) 
df_weather ['Time'] = pd.to_datetime(df_weather['Time'])

df_event = df_event .set_index('Time')
df_weather = df_weather .set_index('Time')

df_event = pd.merge(df_event ,df_weather , how='outer',left_index=True, right_index=True)

Solution

  • Use, pd.merge_asof with direction=nearest to merge the two dataframes on the column Time:

    df_event['Time'] = pd.to_datetime(df_event['Time'])
    df_weather['Time'] = pd.to_datetime(df_weather['Time'])
    
    df_merged = pd.merge_asof(
        df_event.sort_values(by='Time'), df_weather.sort_values(by='Time'),
        on='Time', direction='nearest')
    

    Result:

    # print(df_merged)
    
                     Time  UserID_Detected  Temperature  Humidity
    0 2019-01-01 01:00:32                3           33        60
    1 2019-01-02 06:02:12                2           35        58
    2 2019-01-02 14:45:32                7           38        62