I want to check is a substring from DF1 is in DF2. If it is I want to return a value of a corresponding row.
DF1
Name | ID | Region |
---|---|---|
John | AAA | A |
John | AAA | B |
Pat | CCC | C |
Sandra | CCC | D |
Paul | DD | E |
Sandra | R9D | F |
Mia | dfg4 | G |
Kim | asfdh5 | H |
Louise | 45gh | I |
DF2
Name | ID | Company |
---|---|---|
John | AAAxx1 | Microsoft |
John | AAAxxREG1 | Microsoft |
Michael | BBBER4 | Microsoft |
Pat | CCCERG | Dell |
Pat | CCCERGG | Dell |
Paul | DFHDHF |
Desired Output
Where ID from DF1 is in the ID column of DF2 I want to create a new column in DF1 that matches the company
Name | ID | Region | Company |
---|---|---|---|
John | AAA | A | Microsoft |
John | AAA | B | Microsoft |
Pat | CCC | C | Dell |
Sandra | CCC | D | |
Paul | DD | E | |
Sandra | R9D | F | |
Mia | dfg4 | G | |
Kim | asfdh5 | H | |
Louise | 45gh | I |
I have the below code that determines if the ID from DF1 is in DF2 however I'm not sure how I can bring in the company name.
DF1['Get company'] = np.in1d(DF1['ID'], DF2['ID'])
Try to find ID
string from df1
into df2
then merge
on this column:
key = df2['ID'].str.extract(fr"({'|'.join(df1['ID'].values)})", expand=False)
df1 = df1.merge(df2['Company'], left_on='ID', right_on=key, how='left').fillna('')
print(df1)
# Output:
Name ID Company
0 John AAA
1 Peter BAB Microsoft
2 Paul CCHF Google
3 Rosie R9D
Details: create a regex from df1['ID']
to extract partial string from df2['ID']
:
# Regex pattern: try to extract the following pattern
>>> fr"({'|'.join(df1['ID'].values)})"
'(AAA|BAB|CCHF|R9D)'
# After extraction
>>> pd.concat([df2['ID'], key], axis=1)
ID ID
0 AEDSV NaN # Nothing was found
1 123BAB BAB # Found partial string BAB
2 CCHF-RB CCHF # Found partial string CCHF
3 YYYY NaN # Nothing was found
Update:
To solve this I wonder is it possible to merge based on 2 columns. e.g merge on Name and ID?
key = df2['ID'].str.extract(fr"({'|'.join(df1['ID'].values)})", expand=False)
df1 = pd.merge(df1, df2[['Name', 'Company']], left_on=['Name', 'ID'],
right_on=['Name', key], how='left').drop_duplicates().fillna('')
print(df1)
# Output:
Name ID Region Company
0 John AAA A Microsoft
2 John AAA B Microsoft
4 Pat CCC C Dell
6 Sandra CCC D
7 Paul DD E
8 Sandra R9D F
9 Mia dfg4 G
10 Kim asfdh5 H
11 Louise 45gh I