python pandas dataframe data-analysis data-cleaning

Updating an old dataframe with conditional matching of different columns and adding new rows in pandas

I have an old dataframe with following columns and lot of rows and look like this

>old_df
date/time    Name   detect_ID   category  ID
12/1/2023    XXX    1           B        1400
12/1/2023    XXY    1,3,7       B        1402
12/1/2023    XXY    4           A        1403
12/1/2023    XXY    4           B        1407
.....

I have some information on new_df which has similar column and based on this I want to update the old_df. New dataframe is:

>new_df
date/time    Name   detect_ID   category  ID
13/1/2023    XXX    1           B        1400
14/1/2023    XXY    1,3,8       B        1402
14/1/2023    XXY    1           B        1405
.....

For updating I want following conditions:

I want to iterate through the rows of old_df while checking every rows of new_df and its information. But I want to check only those rows of old_df where Category column value is 'B".
First the program will keep in mind the first row's ID value of new_df for first iteration (and go through all rows of new_df with consecutive iteration). While iterating through rows of old_df. If the ID of this first row of new_df doesnt match matches with any ID of old_df it will take that full row from new_df and add it as a new row of old_df while creating a new column in old_df named Identify and assigning a value new.

If the ID of this new_df matches with any ID of old_df, it will go through that specific row's detect_ID column value. Now few things can happen :

A. If that specific detect_ID value of old_df matches with that first row's detect_ID value of new_df, it will take that specific row of new_df and replace the matched row of old_df, while the newly created column identify will have value updated. In this case. Also, as you can see here detect_ID has multiple values : 1,2,3 I want to check for each of them separately and some of these digits maybe integer. so basically split them with , and converting them to integer.

B. If that detect_ID value of old_df doesnt match with that first row's detect_ID value of new_df, it will take that full row from new_df and add it as a new row of old_df while going in column identify and assigning a value new.

For the rows of old_df, that "ID" value did not match with any rows of new_df having same 'ID or same 'ID' but no match on ' detect_ID' value, will remain unchanged in old_df and have value unchanged in identify column.

I want this to iterate through all rows of old_df until every row of new_df is updated in old_df .

For the givene example, I want output dataframe like following:

>output
date/time    Name   detect_ID   category  ID   identify
13/1/2023    XXX    1           B        1400   updated  [Case A] 
14/1/2023    XXY    1           B        1402   updated  [Case A with multiple detect_ID]
14/1/2023    XXY    3           B        1402   updated
12/1/2023    XXY    7           B        1402   unchanged  [Step 3, Id matches but detect_id do not ]
14/1/2023    XXY    8           B        1402   new        [Case B]
12/1/2023    XXY    4           A        1403   unchanged   
12/1/2023    XXY    4           B        1407   unchanged [Step3 , id not found in new_df]

I am using following code but it seems not working as the way I want. It gives a lot of duplicate and doesn't iterate through a lot of rows of old_df too.

old_df = pd.read_csv('old.csv')
new_df = pd.read_csv('new.csv')

# Create a set of tuples representing the unique (ID, Detector Id) pairs in the old dataframe
unique_pairs = set()
for _, row in old_df.iterrows():
    detector_ids = [int(x) for x in str(row['Detect_ID']).split(',')]
    for detector_id in detector_ids:
        unique_pairs.add((row['ID'], detect_id))

# Iterate over the rows in the new dataframe and check if their (ID, Detector Id) pair is in the set of unique pairs
new_rows = []
updated_rows = []
for _, row in new_df.iterrows():
    detector_ids = [int(x) for x in str(row['Detect_ID']).split(',')]
    for detector_id in detector_ids:
        if (row['ID'], detector_id) in unique_pairs:
            old_row = old_df.loc[(old_df['ID'] == row['ID']) & (old_df['Detect_ID'].str.contains(str(detector_id)))]
            if not old_row.empty:
                old_row = old_row.iloc[0]
                old_row['Date/Time'] = row['date/time']
                old_df.loc[(old_df['ID'] == row['ID']) & (old_df['Detector_ID'].str.contains(str(detector_id))), 'date/time'] = old_row['date/time']
                updated_rows.append(old_row)
        else:
            row['Identify'] = 'new'
            new_rows.append(row)
            unique_pairs.add((row['ID'], detector_id))

# Append the new rows to the old dataframe and write the updated dataframe to a new file
old_df = old_df.append(new_rows, ignore_index=True)
for row in updated_rows:
    row['Identify'] = 'updated'
old_df = old_df.append(updated_rows, ignore_index=True)
old_df.to_csv('updated.csv', index=False)

Solution

Relying on df.iterrows nearly always implies a suboptimal approach to manipulations in pandas (see e.g. this SO post). Here's an approach using df.merge for the important part.

Step 1

Make sure that all the values in column detect_ID are strings by applying Series.astype(str). Now, use Series.str.split and df.explode to get entries like 1,3,7 into separate rows. Apply to both dfs. At this stage, we can change the type (back) to int for all values in detect_ID (assuming that your data indeed consists of numeric characters).

Since we only want to check on rows that have value B in column category, filter out any non-B values from new_df with Series.eq (though, in your current example no such value exists).

old_df['detect_ID'] = old_df['detect_ID'].astype(str).str.split(',')
old_df = old_df.explode('detect_ID', ignore_index=False)
old_df['detect_ID'] = old_df['detect_ID'].astype(int)

new_df['detect_ID'] = new_df['detect_ID'].astype(str).str.split(',')
new_df = new_df.explode('detect_ID', ignore_index=False)
new_df['detect_ID'] = new_df['detect_ID'].astype(int)
new_df = new_df[new_df['category'].eq('B')]

# `dfs` now as follows
old_df

   date/time Name detect_ID category    ID
0  12/1/2023  XXX         1        B  1400
1  12/1/2023  XXY         1        B  1402
1  12/1/2023  XXY         3        B  1402
1  12/1/2023  XXY         7        B  1402
2  12/1/2023  XXY         4        A  1403
3  12/1/2023  XXY         4        B  1407

Step 2

Apply df.merge. We want to merge on ['Name','detect_ID','category', 'ID'], keep all the entries from both sides (so: how='outer'), and also add an indicator column (calling it identify), which will tell us the source of each row. Adding custom suffixes (e.g. '_old' instead of default '_x') for clarity's sake.

res = old_df.merge(new_df, on=['Name','detect_ID','category', 'ID'], 
                   how='outer', indicator='identify', suffixes=('_old','_new'))

res

  date/time_old Name detect_ID category    ID date/time_new    identify
0     12/1/2023  XXX         1        B  1400     13/1/2023        both
1     12/1/2023  XXY         1        B  1402     14/1/2023        both
2     12/1/2023  XXY         3        B  1402     14/1/2023        both
3     12/1/2023  XXY         7        B  1402           NaN   left_only
4     12/1/2023  XXY         4        A  1403           NaN   left_only
5     12/1/2023  XXY         4        B  1407           NaN   left_only
6           NaN  XXY         8        B  1402     14/1/2023  right_only
7           NaN  XXY         1        B  1405     14/1/2023  right_only

Step 3

At this stage, we want to decide on which value we need to keep for column date/time. We need the values from _new for all rows (1) that exist in both dfs, and (2) that exist only in df_new. Since left_only entries will have NaN values in column date/time_new, we can rely on Series.where to accomplish this:

res['date/time'] = res['date/time_new'].where(res['date/time_new'].notna(),
                                              res['date/time_old'])

Step 4

The following remains to be done:

Updating the values for column identify. We can use Series.map for this.
Selecting the correct columns from res in the correct order. Let's use the column names from df_old plus identify inside df.loc, and also chain df.sort_values for this. Let's also here use df.reset_index.

mapper = {'both': 'updated',
          'left_only': 'unchanged',
          'right_only': 'new'}

res['identify'] = res['identify'].map(mapper)
res = (res.loc[:, list(old_df.columns) + ['identify']]
       .sort_values(['ID', 'detect_ID'])
       .reset_index(drop=True))

res

   date/time Name  detect_ID category    ID   identify
0  13/1/2023  XXX          1        B  1400    updated
1  14/1/2023  XXY          1        B  1402    updated
2  14/1/2023  XXY          3        B  1402    updated
3  12/1/2023  XXY          7        B  1402  unchanged
4  14/1/2023  XXY          8        B  1402        new
5  12/1/2023  XXY          4        A  1403  unchanged
6  14/1/2023  XXY          1        B  1405        new
7  12/1/2023  XXY          4        B  1407  unchanged

N.B. As mentioned by @Ashyam in the comments above, your desired result doesn't have the row for ID 1405, which exists only in df_new. I'm here assuming that you do want this entry in the new df. If not, you can get rid of it as follows:

res = res[res['ID'].isin(old_df['ID'])].reset_index(drop=True)

Of course, this operation could then in fact already be applied to new_df. Cf. the filter for value B in column category above.