Search code examples
pandasdatetimeappendtransform

Dynamic Timestamp Appending Based on Conditional Column Values with two datasets


This is a similar problem than one I mentioned a few days ago but with some particular caveats. I sincerely apologize in advance. I have two files that I have put into dataframes, df1 and df2 that have different information and respectively, different headers but in both files they share a common 'id' header. Essentially these two files contain different sets of information for the common 'id' that they share. For example:

df1:

id Date another_col
WPA54 2023-08-01 A
WPA54 2023-08-01 B
WPA54 2023-08-01 C
WPA54 2023-08-01 D
IBT675 2023-08-01 E
IBT675 2023-08-01 F

df2

id DateTime
WPA54 2023-08-01 00:02:52.527
WPA54 2023-08-01 00:10:10.640
WPA54 2023-08-01 00:10:12:740
WPA54 2023-08-01 00:10:26.937
IBT675 2023-08-01 00:10:10.640
IBT675 2023-08-01 00:10:11.540
IBT675 2023-08-01 00:10:12:740

For simplicity purposes, I would like to take the DateTime value of df2 and create a new column in df1 called fix_timestamps with the unique times joined by a semi-colon ' ; ' delimiter and paired with the proper id also in df1. The importance of this is that df1 and df2 have two different shapes. df1 is more 'fixed' and contains information that is considered a point in time, wheres I need to append that with information from df2 because df2 is a larger file that contains many different times that need to be appended to the id of df1.

The difference between the previous question and this was that my previous question was based off of one file via a merge that I now realized I was doing wrong as I started to understand my data more.

I cannot take credit for this as mozway was very helpful in my previous question. This works perfectly for one file but when I use two...

import pandas as pd

# Define the file paths for your CSV files
file1_path = 'input1.csv'
file2_path = 'input2.csv'

# Read the CSV files into DataFrames
df1 = pd.read_csv(file1_path)
df2 = pd.read_csv(file2_path)

def append_timestamp(row):
    id_length = len(row['id'])
    timestamps = []

    for i in range(id_length):
        timestamps.append(row['DateTime'])
    return ';'.join(timestamps)


# My thinking of this was to look in my df2 which has the DateTime and group it by the df2 id and create the new column, 'fix_timestamps' in df1 which would have all the DateTime values already appended.

df1['fix_timestamps'] = (df2['DateTime'].astype(str).groupby(df2['id'])
                        .transform(lambda x: ';'.join(x.unique())))


# Save the DataFrame to a CSV file
output_file_path = 'output'
df1.to_csv(output_file_path, index=False)

What I was expecting was this:

id DateTime
WPA54 2023-08-01 00:02:52.527; 2023-08-01 00:10:10.640;2023-08-01 00:10:12:740;2023-08-01 00:10:26.937
WPA54 2023-08-01 00:02:52.527; 2023-08-01 00:10:10.640;2023-08-01 00:10:12:740;2023-08-01 00:10:26.937
WPA54 2023-08-01 00:02:52.527; 2023-08-01 00:10:10.640;2023-08-01 00:10:12:740;2023-08-01 00:10:26.937
WPA54 2023-08-01 00:02:52.527; 2023-08-01 00:10:10.640;2023-08-01 00:10:12:740;2023-08-01 00:10:26.937
IBT675 2023-08-01 00:10:10.640;2023-08-01 00:10:11.540;2023-08-01 00:10:12:740
IBT675 2023-08-01 00:10:10.640;2023-08-01 00:10:11.540;2023-08-01 00:10:12:740
IBT675 2023-08-01 00:10:10.640;2023-08-01 00:10:11.540;2023-08-01 00:10:12:740

But what I am getting are the same entire appended time series for WPA54, IBET675 and for every id

id DateTime
WPA54 2023-08-01 00:02:52.527; 2023-08-01 00:10:10.640;2023-08-01 00:10:12:740;2023-08-01 00:10:26.937
WPA54 2023-08-01 00:02:52.527; 2023-08-01 00:10:10.640;2023-08-01 00:10:12:740;2023-08-01 00:10:26.937
WPA54 2023-08-01 00:02:52.527; 2023-08-01 00:10:10.640;2023-08-01 00:10:12:740;2023-08-01 00:10:26.937
WPA54 2023-08-01 00:02:52.527; 2023-08-01 00:10:10.640;2023-08-01 00:10:12:740;2023-08-01 00:10:26.937
IBT675 2023-08-01 00:02:52.527; 2023-08-01 00:10:10.640;2023-08-01 00:10:12:740;2023-08-01 00:10:26.937
IBT675 2023-08-01 00:02:52.527; 2023-08-01 00:10:10.640;2023-08-01 00:10:12:740;2023-08-01 00:10:26.937
IBT675 2023-08-01 00:02:52.527; 2023-08-01 00:10:10.640;2023-08-01 00:10:12:740;2023-08-01 00:10:26.937

Thank you in advance


Solution

  • You can solve it this way:

    import pandas as pd
    
    data_df1 = {
        'id': ['WPA54', 'WPA54', 'WPA54', 'WPA54', 'IBT675', 'IBT675'],
        'Date': ['2023-08-01', '2023-08-01', '2023-08-01', '2023-08-01', '2023-08-01', '2023-08-01'],
        'another_col': ['A', 'B', 'C', 'D', 'E', 'F']
    }
    
    data_df2 = {
        'id': ['WPA54', 'WPA54', 'WPA54', 'WPA54', 'IBT675', 'IBT675', 'IBT675'],
        'DateTime': ['2023-08-01 00:02:52.527', '2023-08-01 00:10:10.640', '2023-08-01 00:10:12.740', 
                     '2023-08-01 00:10:26.937', '2023-08-01 00:10:10.640', '2023-08-01 00:10:11.540', 
                     '2023-08-01 00:10:12.740']
    }
    
    df1 = pd.DataFrame(data_df1)
    df2 = pd.DataFrame(data_df2)
    
    timestamps = df2.groupby('id')['DateTime'].apply(lambda x: ';'.join(x.unique())).reset_index()
    timestamps.columns = ['id', 'fix_timestamps']
    df1_updated = pd.merge(df1, timestamps, on='id', how='left')
    
    df1_updated
    
    timestamps = df2.groupby('id')['DateTime'].apply(lambda x: ';'.join(x.unique())).reset_index()
    timestamps.columns = ['id', 'fix_timestamps']
    
    df1 = pd.merge(df1, timestamps, on='id', how='left')
    
    df1.to_csv(r'C:\Users\s-degossondevarennes/outputdd.csv', index=False)   
    

    which will return

    enter image description here