I have an old dataframe with following columns and lot of rows and look like this
>old_df
date/time Name detect_ID category ID
12/1/2023 XXX 1 B 1400
12/1/2023 XXY 1,3,7 B 1402
12/1/2023 XXY 4 A 1403
12/1/2023 XXY 4 B 1407
.....
I have some information on new_df which has similar column and based on this I want to update the old_df. New dataframe is:
>new_df
date/time Name detect_ID category ID
13/1/2023 XXX 1 B 1400
14/1/2023 XXY 1,3,8 B 1402
14/1/2023 XXY 1 B 1405
.....
For updating I want following conditions:
If the ID of this new_df matches with any ID of old_df, it will go through that specific row's detect_ID column value. Now few things can happen :
A. If that specific detect_ID value of old_df matches with that first row's detect_ID value of new_df, it will take that specific row of new_df and replace the matched row of old_df, while the newly created column identify will have value updated. In this case. Also, as you can see here detect_ID has multiple values : 1,2,3 I want to check for each of them separately and some of these digits maybe integer. so basically split them with , and converting them to integer.
B. If that detect_ID value of old_df doesnt match with that first row's detect_ID value of new_df, it will take that full row from new_df and add it as a new row of old_df while going in column identify and assigning a value new.
I want this to iterate through all rows of old_df until every row of new_df is updated in old_df .
For the givene example, I want output dataframe like following:
>output
date/time Name detect_ID category ID identify
13/1/2023 XXX 1 B 1400 updated [Case A]
14/1/2023 XXY 1 B 1402 updated [Case A with multiple detect_ID]
14/1/2023 XXY 3 B 1402 updated
12/1/2023 XXY 7 B 1402 unchanged [Step 3, Id matches but detect_id do not ]
14/1/2023 XXY 8 B 1402 new [Case B]
12/1/2023 XXY 4 A 1403 unchanged
12/1/2023 XXY 4 B 1407 unchanged [Step3 , id not found in new_df]
I am using following code but it seems not working as the way I want. It gives a lot of duplicate and doesn't iterate through a lot of rows of old_df too.
old_df = pd.read_csv('old.csv')
new_df = pd.read_csv('new.csv')
# Create a set of tuples representing the unique (ID, Detector Id) pairs in the old dataframe
unique_pairs = set()
for _, row in old_df.iterrows():
detector_ids = [int(x) for x in str(row['Detect_ID']).split(',')]
for detector_id in detector_ids:
unique_pairs.add((row['ID'], detect_id))
# Iterate over the rows in the new dataframe and check if their (ID, Detector Id) pair is in the set of unique pairs
new_rows = []
updated_rows = []
for _, row in new_df.iterrows():
detector_ids = [int(x) for x in str(row['Detect_ID']).split(',')]
for detector_id in detector_ids:
if (row['ID'], detector_id) in unique_pairs:
old_row = old_df.loc[(old_df['ID'] == row['ID']) & (old_df['Detect_ID'].str.contains(str(detector_id)))]
if not old_row.empty:
old_row = old_row.iloc[0]
old_row['Date/Time'] = row['date/time']
old_df.loc[(old_df['ID'] == row['ID']) & (old_df['Detector_ID'].str.contains(str(detector_id))), 'date/time'] = old_row['date/time']
updated_rows.append(old_row)
else:
row['Identify'] = 'new'
new_rows.append(row)
unique_pairs.add((row['ID'], detector_id))
# Append the new rows to the old dataframe and write the updated dataframe to a new file
old_df = old_df.append(new_rows, ignore_index=True)
for row in updated_rows:
row['Identify'] = 'updated'
old_df = old_df.append(updated_rows, ignore_index=True)
old_df.to_csv('updated.csv', index=False)
Relying on df.iterrows
nearly always implies a suboptimal approach to manipulations in pandas
(see e.g. this SO post). Here's an approach using df.merge
for the important part.
Step 1
Make sure that all the values in column detect_ID
are strings by applying Series.astype(str)
. Now, use Series.str.split
and df.explode
to get entries like 1,3,7
into separate rows. Apply to both dfs
. At this stage, we can change the type (back) to int
for all values in detect_ID
(assuming that your data indeed consists of numeric characters).
Since we only want to check on rows that have value B
in column category
, filter out any non-B
values from new_df
with Series.eq
(though, in your current example no such value exists).
old_df['detect_ID'] = old_df['detect_ID'].astype(str).str.split(',')
old_df = old_df.explode('detect_ID', ignore_index=False)
old_df['detect_ID'] = old_df['detect_ID'].astype(int)
new_df['detect_ID'] = new_df['detect_ID'].astype(str).str.split(',')
new_df = new_df.explode('detect_ID', ignore_index=False)
new_df['detect_ID'] = new_df['detect_ID'].astype(int)
new_df = new_df[new_df['category'].eq('B')]
# `dfs` now as follows
old_df
date/time Name detect_ID category ID
0 12/1/2023 XXX 1 B 1400
1 12/1/2023 XXY 1 B 1402
1 12/1/2023 XXY 3 B 1402
1 12/1/2023 XXY 7 B 1402
2 12/1/2023 XXY 4 A 1403
3 12/1/2023 XXY 4 B 1407
Step 2
Apply df.merge
. We want to merge on ['Name','detect_ID','category', 'ID']
, keep all the entries from both sides (so: how='outer'
), and also add an indicator
column (calling it identify
), which will tell us the source of each row. Adding custom suffixes (e.g. '_old'
instead of default '_x'
) for clarity's sake.
res = old_df.merge(new_df, on=['Name','detect_ID','category', 'ID'],
how='outer', indicator='identify', suffixes=('_old','_new'))
res
date/time_old Name detect_ID category ID date/time_new identify
0 12/1/2023 XXX 1 B 1400 13/1/2023 both
1 12/1/2023 XXY 1 B 1402 14/1/2023 both
2 12/1/2023 XXY 3 B 1402 14/1/2023 both
3 12/1/2023 XXY 7 B 1402 NaN left_only
4 12/1/2023 XXY 4 A 1403 NaN left_only
5 12/1/2023 XXY 4 B 1407 NaN left_only
6 NaN XXY 8 B 1402 14/1/2023 right_only
7 NaN XXY 1 B 1405 14/1/2023 right_only
Step 3
At this stage, we want to decide on which value we need to keep for column date/time
. We need the values from _new
for all rows (1) that exist in both dfs
, and (2) that exist only in df_new
. Since left_only
entries will have NaN
values in column date/time_new
, we can rely on Series.where
to accomplish this:
res['date/time'] = res['date/time_new'].where(res['date/time_new'].notna(),
res['date/time_old'])
Step 4
The following remains to be done:
identify
. We can use Series.map
for this.res
in the correct order. Let's use the column names from df_old
plus identify
inside df.loc
, and also chain df.sort_values
for this. Let's also here use df.reset_index
.mapper = {'both': 'updated',
'left_only': 'unchanged',
'right_only': 'new'}
res['identify'] = res['identify'].map(mapper)
res = (res.loc[:, list(old_df.columns) + ['identify']]
.sort_values(['ID', 'detect_ID'])
.reset_index(drop=True))
res
date/time Name detect_ID category ID identify
0 13/1/2023 XXX 1 B 1400 updated
1 14/1/2023 XXY 1 B 1402 updated
2 14/1/2023 XXY 3 B 1402 updated
3 12/1/2023 XXY 7 B 1402 unchanged
4 14/1/2023 XXY 8 B 1402 new
5 12/1/2023 XXY 4 A 1403 unchanged
6 14/1/2023 XXY 1 B 1405 new
7 12/1/2023 XXY 4 B 1407 unchanged
N.B. As mentioned by @Ashyam in the comments above, your desired result doesn't have the row for ID 1405
, which exists only in df_new
. I'm here assuming that you do want this entry in the new df
. If not, you can get rid of it as follows:
res = res[res['ID'].isin(old_df['ID'])].reset_index(drop=True)
Of course, this operation could then in fact already be applied to new_df
. Cf. the filter for value B
in column category
above.