Search code examples
pythonpandasvalidationmerge

Validate in merge function pandas


Today I was trying to go a little deeper in the merge() function of pandas, and I found the option validate, which, as reported in the documentation, can be:

validate : string, default None

If specified, checks if merge is of specified type.

“one_to_one” or “1:1”: check if merge keys are unique in both left and right datasets. “one_to_many” or “1:m”: check if merge keys are unique in left dataset. “many_to_one” or “m:1”: check if merge keys are unique in right dataset. "many_to_many” or “m:m”: allowed, but does not result in checks.

I have looked around to find a working example on where and how to use this function, but I couldn't find any. Moreover when I tried to apply it to a group of DataFrame's I was merging, it didn't seem to change the output. Can anyone give me a working example, to make me understand it better?

Thanks in advance,

Mattia


Solution

  • The new validate param will raise a MergeError if the validation fails, example:

    df1 = pd.DataFrame({'a':list('aabc'),'b':np.random.randn(4)})
    df2 = pd.DataFrame({'a':list('aabc'),'b':np.random.randn(4)})
    
    print(df1)
    print(df2)
    
       a         b
    0  a -2.557152
    1  a -0.145969
    2  b -1.629560
    3  c -0.233517
    
       a         b
    0  a -0.352038
    1  a  0.490438
    2  b  0.319452
    3  c -0.599481
    

    Now if we merge on column 'a' without validate:

    In[39]:
    df1.merge(df2, on='a')
    
    Out[39]: 
       a       b_x       b_y
    0  a -2.557152 -0.352038
    1  a -2.557152  0.490438
    2  a -0.145969 -0.352038
    3  a -0.145969  0.490438
    4  b -1.629560  0.319452
    5  c -0.233517 -0.599481
    

    This works but we get more rows for 'a' as column 'b' is different, now we pass validate='1:1', we get an error:

    MergeError: Merge keys are not unique in either
    left or right dataset; not a one-to-one merge
    

    if we pass validate='1:m' we get a different error:

    MergeError: Merge keys are not unique in left
    dataset;not a one-to-many merge
    

    Again this fails the validation, if we pass 'm:m':

    In[42]:
    df1.merge(df2, on='a',validate='m:m')
    
    Out[42]: 
       a       b_x       b_y
    0  a -2.557152 -0.352038
    1  a -2.557152  0.490438
    2  a -0.145969 -0.352038
    3  a -0.145969  0.490438
    4  b -1.629560  0.319452
    5  c -0.233517 -0.599481
    

    no error occurs and we get the same merged df if we had not passed the validate param

    The api docs don't give an example but the what's new section does, the original github enhancement also gives further background information