Search code examples
pythonpandasdataframemergenan

Prevent merge of dataframes from dropping data and replacing with NaN


I have two df I wish to merge. Examples:

dfone

    Centre  Name    open  click
0   12345   Ed      4     7
1   67890   Dave    8     3
2   54321   Mary    5     6
3   09876   Susan   4     3
4   24680   Tom     9     7
5   13579   Will    6     2

dftwo

    Centre  Name    Train  Online Home
0   67890   Dave    Y      Y      N
1   13579   Will    N      Y      Y
2   09876   Susan   Y      Y      Y
3   24680   Tom     N      N      Y
4   54321   Mary    Y      Y      N
5   12345   Ed      N      N      N

I use the following to merge the two dfs:

dfthree = pd.merge(dfone, dftwo, left_on='Centre', right_on='Centre', how='left',
                  suffixes=('','_y')).filter(regex='^(?!.*_y)')

My desired output is:

    Centre  Name    open  click  Train  Online Home
0   12345   Ed      4     7      N      N      N
1   67890   Dave    8     3      Y      Y      N
2   54321   Mary    5     6      Y      Y      N
3   09876   Susan   4     3      Y      Y      Y
4   24680   Tom     9     7      N      N      Y
5   13579   Will    6     2      N      Y      Y

My ACTUAL output is:

    Centre  Name    open  click  Train  Online Home
0   12345   Ed      4     7      N      N      N
1   67890   Dave    8     3      N      N      N
2   54321   Mary    5     6      N      N      N
3   09876   Susan   4     3      Nan    Nan    Nan
4   24680   Tom     9     7      Nan    Nan    Nan
5   13579   Will    6     2      Nan    Nan    Nan

Where am I going wrong please and what should I be doing instead?


Solution

  • Example

    import pandas as pd
    
    data1 = {'Centre': {0: 12345, 1: 67890, 2: 54321, 3: 9876, 4: 24680, 5: 13579},
     'Name': {0: 'Ed', 1: 'Dave', 2: 'Mary', 3: 'Susan', 4: 'Tom', 5: 'Will'},
     'open': {0: 4, 1: 8, 2: 5, 3: 4, 4: 9, 5: 6},
     'click': {0: 7, 1: 3, 2: 6, 3: 3, 4: 7, 5: 2}}
    
    data2 = {'Centre': {0: 67890, 1: 13579, 2: 9876, 3: 24680, 4: 54321, 5: 12345},
     'Name': {0: 'Dave', 1: 'Will', 2: 'Susan', 3: 'Tom', 4: 'Mary', 5: 'Ed'},
     'Train': {0: 'Y', 1: 'N', 2: 'Y', 3: 'N', 4: 'Y', 5: 'N'},
     'Online': {0: 'Y', 1: 'Y', 2: 'Y', 3: 'N', 4: 'Y', 5: 'N'},
     'Home': {0: 'N', 1: 'Y', 2: 'Y', 3: 'Y', 4: 'N', 5: 'N'}}
    
    df1 = pd.DataFrame(data1)
    df2 = pd.DataFrame(data2)
    

    Code

    df3 = df1.merge(df2, how='left')
    

    df3

        Centre  Name    open    click   Train   Online  Home
    0   12345   Ed      4       7       N       N       N
    1   67890   Dave    8       3       Y       Y       N
    2   54321   Mary    5       6       Y       Y       N
    3   9876    Susan   4       3       Y       Y       Y
    4   24680   Tom     9       7       N       N       Y
    5   13579   Will    6       2       N       Y       Y
    

    Applying the code to an well-made example before you apply it to your dataset will reveal the cause of most problems. Your problem is not a problem with code.

    if the merge result is NaN instead of the desired result, the actual data is the same when you see it with your eyes, but in reality it is often different.

    ex) white space, case, data type, etc.

    'abc' ≠ ' abc'

    'abc' ≠ 'Abc'

    1 ≠ '1'


    merge need exact same key, unless return NaN. The ex above is a typical example of what we think is the same key but different.