Search code examples
sqlscalaapache-sparkpysparkapache-zeppelin

Is there a way to compare all rows in one column of a dataframe against all rows in another column of another dataframe (spark)?


I have two dataframes in Spark, both with an IP column. One column has over 800000 entries while the other has 4000 entries. What I want to do is to see if the IP's in the smaller dataframe appear in the IP column of the large dataframe.

At the moment all I can manage is to compare the first row of the columns, second rows of the columns etc.

Thanks in advance!


Solution

  • How about something like this, straight from the manuals using EXISTS or NOT EXISTS, having created tempviews, see here https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2728434780191932/1483312212640900/6987336228780374/latest.html:

    %sql
    SELECT  *
    FROM    t1 A
    WHERE   NOT EXISTS (SELECT  1
                          FROM  t2 B
                         WHERE  B.colx = A.colx)