pandas dataframe pyspark aws-glue aws-glue-spark

DataFrame remove rows existing in another DataFrame

I have two data frames:

df1:

+----------+-------------+-------------+--------------+---------------+
|customerId|     fullName|   telephone1|    telephone2|          email|
+----------+-------------+-------------+--------------+---------------+
|    201534|MARIO JIMENEZ|01722-3500391|+5215553623333|ascencio@my.com|
|    879535|  MARIO LOPEZ|01722-3500377|+5215553623333| asceloe@my.com|
+----------+-------------+-------------+--------------+---------------+

df2:

+----------+-------------+-------------+--------------+---------------+
|customerId|     fullName|   telephone1|    telephone2|          email|
+----------+-------------+-------------+--------------+---------------+
|    201534|MARIO JIMENEZ|01722-3500391|+5215553623333|ascencio@my.com|
|    201536|  ROBERT MITZ|01722-3500377|+5215553623333| asceloe@my.com|
|    201537|     MARY ENG|01722-3500127|+5215553623111|generic1@my.com|
|    201538|    RICK BURT|01722-3500983|+5215553623324|generic2@my.com|
|    201539|     JHON DOE|01722-3502547|+5215553621476|generic3@my.com|
+----------+-------------+-------------+--------------+---------------+

And I need to get a third DataFrame with the ones from df1 that does not exist in df2.

like this:

+----------+-------------+-------------+--------------+---------------+
|customerId|     fullName|   telephone1|    telephone2|          email|
+----------+-------------+-------------+--------------+---------------+
|    879535|  MARIO LOPEZ|01722-3500377|+5215553623333| asceloe@my.com|
+----------+-------------+-------------+--------------+---------------+

Whats is the correct way of doing this?

I've already tried the following:

diff = df2.join(df1, df2['customerId'] != df1['customerId'],"left")

diff = df1.subtract(df2)

diff = df1[~ df1['customerId'].isin(df2['customerId'])]

But they do not work, any suggestions?

Solution

Using pyspark:

You can create a list containing the customerId from DF2 with collect():

from pyspark.sql.types import *
id_df2 = [id[0] for id in df2.select('customerId').distinct().collect()]

And then filter your DF1 customerId using isin with negation ~:

diff = df1.where(~col('customerId').isin(id_df2))