Search code examples
dataframeapache-sparkpysparkaws-gluepyspark-pandas

PySpark: Find if a value present in another dataframe


I have a dataframes like below. Tried Join and isin functions but not getting the expected output like below. Not sure what I was missing. Appreciate if someone can help. Thanks.

DF1:

Name Grade
Tom A
Mike B
John C

DF2:

Approved_Grades Description
A Approved grade

Expected Output:

Name Grade Approved
Tom A Yes
Mike B No
John C No

Solution

  • Please use 'join' in pyspark to join tables and then using conditions to filter out the result. Check the solution below.

    from pyspark.sql import SparkSession
    from pyspark.sql.functions import col, when
    
    spark = SparkSession.builder.getOrCreate()
    
    data1 = [('Tom', 'A'),
         ('Mike', 'B'),
         ('John', 'C')]
    df1 = spark.createDataFrame(data1, ['Name', 'Grade'])
    
    data2 = [('A', 'Approved grade')]
    df2 = spark.createDataFrame(data2, ['Approved_Grades', 'Description'])
    
    merged_df = df1.join(df2, df1.Grade == df2.Approved_Grades, 'left') \
    .withColumn('Approved', when(col('Approved_Grades').isNull(), 'No').otherwise('Yes')) \
    .drop('Approved_Grades', 'Description')
    
    merged_df.show()
    

    Please use the above code and try if that works. thanks