Search code examples
pythonpandasdataframegroup-by

Pandas | Use groupby to find duplicate based on 2 columns which have NaN values


I have the following DataFrame where there are duplicates of the same person.

colnames = ["person", "ig_handle", "email", "phone_number"]
data = [
    ["emma", '@emma', '[email protected]', np.nan],
    ["emma", '@emma', np.nan, "09-emma-number"],
    ["jean", '@jean', "[email protected]", np.nan],
    ["jean", '@jean', np.nan, np.nan],
    ["kate", '@kate', "[email protected]", np.nan],
    ["john", '@john1', "[email protected]", "09-john-number"],
    ["john", '@john1', np.nan, "09-john-number-new"],
    ["john", np.nan, "[email protected]", np.nan],
    ["john", '@john2', np.nan, np.nan],
    ["lily", np.nan, np.nan, np.nan],
]

df = pd.DataFrame(data, columns=colnames)
df

We can identify duplicates by either ig_handle or email. Example: There are 2 john's but they have different ig_handle.

Some rows have the data the other row doesn't have. For example, emma on first row has email but the second row doesn't.

I want to eliminate duplicate rows by merging the data together.

So the end result would be as below:

person   ig_handle  email            phone_number
emma     @emma      [email protected]   09-emma-number
jean     @jean      [email protected]   NaN
kate     @kate      [email protected]   NaN
john     @john1     [email protected]  09-join-number-new
john     @john2     NaN              NaN
lily     NaN        NaN              NaN

Solution

  • Try this:

    df.groupby(['person',df['ig_handle'].fillna('@' + df['email'].str.split('@').str[0])],as_index=False,dropna=False).last()
    

    Output:

      person ig_handle            email        phone_number
    0   emma     @emma   [email protected]      09-emma-number
    1   jean     @jean   [email protected]                None
    2   john    @john1  [email protected]  09-john-number-new
    3   john    @john2             None                None
    4   kate     @kate   [email protected]                None
    5   lily      None             None                None