Search code examples
apache-sparkpysparkapache-spark-sqlsql-updateazure-databricks

How to update two columns in PySpark satisfying the same condition?


I have a table in which there are 4 columns: "ID", "FLAG_A", "FLAG_B", "FLAG_C". This is the SQL query I want to transform into PySpark, there are two conditions which I need to satisfy for both columns "FLAG_A" and "FLAG_B". How to do it in PySpark?

UPDATE STATUS_TABLE SET STATUS_TABLE.[FLAG_A] = "JAVA", 
STATUS_TABLE.FLAG_B = "PYTHON"
WHERE (((STATUS_TABLE.[FLAG_A])="PROFESSIONAL_CODERS") AND 
((STATUS_TABLE.FLAG_C) Is Null)); 

Is it possible to code this in a single statement by giving two conditions and satisfying the "FLAG_A" and "FLAG_B" columns in PySpark?


Solution

  • I can't think of any way to rewrite this into a single statement which you thought of. I tried writing the UPDATE query inside Spark, but it seems UPDATE is not working:

    : java.lang.UnsupportedOperationException: UPDATE TABLE is not supported temporarily.

    The following does exactly the same as your UPDATE query:

    Input:

    from pyspark.sql import functions as F
    df = spark.createDataFrame(
        [(1, 'PROFESSIONAL_CODERS', 'X', None),
         (2, 'KEEP', 'KEEP', 'KEEP')],
        ['ID', 'FLAG_A', 'FLAG_B', 'FLAG_C'])
    

    Script:

    cond = (F.col('FLAG_A') == 'PROFESSIONAL_CODERS') & F.isnull('FLAG_C')
    df = df.withColumn('FLAG_B', F.when(cond, 'PYTHON').otherwise(F.col('FLAG_B')))
    df = df.withColumn('FLAG_A', F.when(cond, 'JAVA').otherwise(F.col('FLAG_A')))
    
    df.show()
    # +---+------+------+------+
    # | ID|FLAG_A|FLAG_B|FLAG_C|
    # +---+------+------+------+
    # |  1|  JAVA|PYTHON|  null|
    # |  2|  KEEP|  KEEP|  KEEP|
    # +---+------+------+------+