Search code examples
pythonapache-sparkpysparkazure-databricks

How to write a SQL query in Pyspark?


I have a SQL query that I am trying to convert into pyspark the query looks like this

UPDATE EMPLOYEE_TABLE SET A_CONVERSION_TEMPLATE.[STATUS] = "OTHER/MENTIONED" WHERE (((EMPLOYEE_TABLE.[HOLIDAY_TAKEN]) Is Null));

Second Query:

UPDATE EMPLOYEE_TABLE SET EMPLOYEE_TABLE.[STATUS] = "OTHER/MENTIONED" WHERE (((EMPLOYEE_TABLE.STATUS) Is Null));

Third Query with two conditions

UPDATE EMPLOYEE_TABLE SET EMPLOYEE_TABLE.[LANGUAGE] = "ENGLISH", 
WHERE (((EMPLOYEE_TABLE.LANGUAGE[])<>"ENGLISH") AND ((EMPLOYEE_TABLE.FLAGYESNO)="Y"));

The Pyspark queries I am trying are I have combined the first and second queries because in both we have to update the same column. The query I am using is

df=df.withColumn('STATUS', F.when( (col('STATUS')==''),'OTHER/MENTIONED').
otherwise(when(   (col('HOLIDAY_TAKEN]')==''), 'OTHER/MENTIONED').
otherwise('NULL')))

When I am running my df it is updating everything to null in my 'STATUS' Column even the values which were already present it is changing it to 'NULL'

For the third query, I am getting no idea. Please let me know if need other information. Attaching the Output for more understanding enter image description here Thanks in advance for your responses


Solution

    1. Create a data frame as follows:

               +----------+----------+-------------+
               |DEPARTMENT|    STATUS|HOLIDAY_TAKEN|
               +----------+----------+-------------+
               |        HR|      null|        TAKEN|
               |   MANAGER|      null|        TAKEN|
               |        HR|   WORKING|      ONLEAVE|
               |  ENGINEER|ONCONTRACT|    SICKLEAVE|
               |        PA|      null|         null|
               |       CEO|   ON VISA|         null|
               |  ENGINEER|      null|         null|
               |   MANAGER|      null|         null|
               +----------+----------+-------------+
      
    2. Then write a condition satisfying both the conditions in a single query:

       from pyspark.sql import functions as fx
       from pyspark.sql.functions import *
      
       expression = """CASE 
                           WHEN (HOLIDAY_TAKEN IS NULL AND STATUS IS NULL) OR (HOLIDAY_TAKEN IS NOT NULL AND STATUS IS NULL) THEN 'OTHER/MENTIONED' 
                           ELSE STATUS 
                       END"""
      
       df2 = df.withColumn('STATUS',fx.expr(expression))
      
       df2.show()
      

    Please check the below image for your reference: enter image description here