Search code examples
dataframepyspark.when

When condition in Pyspark dataframe


I have the below dataframe

Column_1     Column_2       Column_3
A            1
A            2
A            3
A            4
A            5
B            1
B            4
B            5
C            1
C            2

I have to populate Column_3 based on values in Column_1 and Column_2. If Column_1 in ('A','B') and Column_2 not in ('1','3','5') i have to populate column_3 with X else Y.

Expected Output:

Column_1     Column_2       Column_3
A            1              Y
A            2              X
A            3              Y
A            4              X
A            5              Y
B            1              Y            
B            4              X
B            5              Y
C            1              Y
C            2              Y

What I tried:

I tried with when and otherwise statement, but not sure how to use Not in along with the when statement. Any help on this would be highly appreciated


Solution

  • You can leverage isin and inverse ~:

    import pyspark.sql.functions as F
    
    c = (F.when(df['Column_1'].isin(['A','B']) & 
          (~df['Column_2'].isin([1,3,5])),'X').otherwise('Y'))
    df.withColumn("Column_3",c).show()
    

    Or:

    expr = """CASE 
            WHEN Column_1 IN ('A','B') and Column_2 NOT IN (1,3,5) 
            THEN 'X' ELSE 'Y' 
            END as Column_3"""
    df.selectExpr("*",expr).show()
    

    +--------+--------+--------+
    |Column_1|Column_2|Column_3|
    +--------+--------+--------+
    |       A|       1|       Y|
    |       A|       2|       X|
    |       A|       3|       Y|
    |       A|       4|       X|
    |       A|       5|       Y|
    |       B|       1|       Y|
    |       B|       4|       X|
    |       B|       5|       Y|
    |       C|       1|       Y|
    |       C|       2|       Y|
    +--------+--------+--------+
    

    More details:

    df['Column_1'].isin(['A','B'])
    #Column<b'(Column_1 IN (A, B))'>
    ~df['Column_2'].isin([1,3,5])
    #Column<b'(NOT (Column_2 IN (1, 3, 5)))'>