Search code examples
pythonpandasmerge

Pandas merge creates unwanted duplicate entries


I'm new to Pandas and I want to merge two datasets that have similar columns. The columns are going to each have some unique values compared to the other column, in addition to many identical values. There are some duplicates in each column that I'd like to keep. My desired output is shown below. Adding how='inner' or 'outer' does not yield the desired result.

import pandas as pd

df1 = df2 = pd.DataFrame({'A': [2,2,3,4,5]})

print(pd.merge(df1,df2))

output:
   A
0  2
1  2
2  2
3  2
4  3
5  4
6  5

desired/expected output:
   A
0  2
1  2
2  3
3  4
4  5

Please let me know how/if I can achieve the desired output using merge, thank you!

EDIT To clarify why I'm confused about this behavior, if I simply add another column, it doesn't make four 2's but rather there are only two 2's, so I would expect that in my first example it would also have the two 2's. Why does the behavior seem to change, what's pandas doing?

import pandas as pd
df1 = df2 = pd.DataFrame(
    {'A': [2,2,3,4,5], 'B': ['red','orange','yellow','green','blue']}
)

print(pd.merge(df1,df2))

output:
   A       B
0  2     red
1  2  orange
2  3  yellow
3  4   green
4  5    blue

However, based on the first example I would expect:
   A       B
0  2     red
1  2  orange
2  2     red
3  2  orange
4  3  yellow
5  4   green
6  5    blue

Solution

  • import pandas as pd
    
    dict1 = {'A':[2,2,3,4,5]}
    dict2 = {'A':[2,2,3,4,5]}
    
    df1 = pd.DataFrame(dict1).reset_index()
    df2 = pd.DataFrame(dict2).reset_index()
    
    df = df1.merge(df2, on = 'A')
    df = pd.DataFrame(df[df.index_x==df.index_y]['A'], columns=['A']).reset_index(drop=True)
    
    print(df)
    

    Output:

       A
    0  2
    1  2
    2  3
    3  4
    4  5