I have the following DataFrame where there are duplicates of the same person.
colnames = ["person", "ig_handle", "email", "phone_number"]
data = [
["emma", '@emma', '[email protected]', np.nan],
["emma", '@emma', np.nan, "09-emma-number"],
["jean", '@jean', "[email protected]", np.nan],
["jean", '@jean', np.nan, np.nan],
["kate", '@kate', "[email protected]", np.nan],
["john", '@john1', "[email protected]", "09-john-number"],
["john", '@john1', np.nan, "09-john-number-new"],
["john", np.nan, "[email protected]", np.nan],
["john", '@john2', np.nan, np.nan],
["lily", np.nan, np.nan, np.nan],
]
df = pd.DataFrame(data, columns=colnames)
df
We can identify duplicates by either ig_handle
or email
. Example: There are 2 john's but they have different ig_handle
.
Some rows have the data the other row doesn't have. For example, emma on first row has email
but the second row doesn't.
I want to eliminate duplicate rows by merging the data together.
So the end result would be as below:
person ig_handle email phone_number
emma @emma [email protected] 09-emma-number
jean @jean [email protected] NaN
kate @kate [email protected] NaN
john @john1 [email protected] 09-join-number-new
john @john2 NaN NaN
lily NaN NaN NaN
Try this:
df.groupby(['person',df['ig_handle'].fillna('@' + df['email'].str.split('@').str[0])],as_index=False,dropna=False).last()
Output:
person ig_handle email phone_number
0 emma @emma [email protected] 09-emma-number
1 jean @jean [email protected] None
2 john @john1 [email protected] 09-john-number-new
3 john @john2 None None
4 kate @kate [email protected] None
5 lily None None None