Search code examples
pythonsqlpandasexcept

What is the analogue of EXCEPT clause in SQL in Pandas?


I have a sample pandas dataframe df:

        col1    col2    col3    col4
      0   a      1.0    2.0      3
      1   b      NaN    NaN      6
      2   c      NaN    8.0      9
      3   d      NaN    11.0    12
      4   e     13.0    14.0    15
      5   f     17.0    18.0    19
      6   g     21.0    22.0    23

and a second one df1:

        col1    col2    col3    col4
      0  a      1.0     2.0      3
      4  e     13.0    14.0     15
      5  f     17.0    18.0     19
      6  g     21.0    22.0     23

I want to get the subset of df that does not overlaps with df1. In effect I am looking for the equivalent of the EXCEPT operand in SQL.

I used the subtract() function -- but this was clearly wrong, as the subtract performs elementwise numerical subtraction. So I got an error message:

       TypeError: unsupported operand type(s) for -: 'str' and 'str'

So the question is: What is the equivalent of EXCEPT in SQL for Pandas?


Solution

  • I think you need set_index of all string columns first:

    df2 = df.set_index('col1').subtract(df1.set_index('col1'), axis='columns')
    print (df2)
          col2  col3  col4
    col1                  
    a      0.0   0.0   0.0
    b      NaN   NaN   NaN
    c      NaN   NaN   NaN
    d      NaN   NaN   NaN
    e      0.0   0.0   0.0
    f      0.0   0.0   0.0
    g      0.0   0.0   0.0
    

    Or:

    df2 = df.set_index('col1').subtract(df1.set_index('col1'), axis='columns', fill_value=0)
    print (df2)
          col2  col3  col4
    col1                  
    a      0.0   0.0   0.0
    b      NaN   NaN   6.0
    c      NaN   8.0   9.0
    d      NaN  11.0  12.0
    e      0.0   0.0   0.0
    f      0.0   0.0   0.0
    g      0.0   0.0   0.0
    

    EDIT by edited question:

    print (df.isin(df1))
        col1   col2   col3   col4
    0   True   True   True   True
    1  False  False  False  False
    2  False  False  False  False
    3  False  False  False  False
    4   True   True   True   True
    5   True   True   True   True
    6   True   True   True   True
    
    print (df.isin(df1).all(axis=1))
    0     True
    1    False
    2    False
    3    False
    4     True
    5     True
    6     True
    dtype: bool
    
    print (~df.isin(df1).all(axis=1))
    0    False
    1     True
    2     True
    3     True
    4    False
    5    False
    6    False
    dtype: bool
    
    print (df[~(df.isin(df1).all(axis=1))])
      col1  col2  col3  col4
    1    b   NaN   NaN     6
    2    c   NaN   8.0     9
    3    d   NaN  11.0    12