Search code examples
pythondataframepyspark

loss of data when using pyspark filter select when and otherwise


I have a dataframe df1 :

A B
null test 1
none test 3
AAAA test 2
BBBB test 4
CCCC test 5

I want to keep only 3 value for column A : null, AAAA and BBBB I did the filtrer :

df2= df1.select(col('*'),F.when(~col("A").isin(['AAAA','BBBB']),"null")
        .otherwise(col("A")).alias("C")
)

I want to have this :

A C B
null null test 1
none null test 3
AAAA AAAA test 2
BBBB BBBB test 4
CCCC null test 5

but what i got is différent , I lost all the rows with value equal to AAAA :

A C B
null null test 1
none null test 3
BBBB BBBB test 4
CCCC null test 5

do you know what is the problem ? Thanks


Solution

  • This is the effective code, for filtering column A with values [AAAA, BBBB] and NULL

    from pyspark.sql import functions as F
    
    df1 = df1.withColumn('C', F.when(F.col('A').isin(['AAAA', 'BBBB']), F.col('A')).otherwise(F.lit(None)))
    

    You weren't filtering the columns, but creating a new column, based on the values of A.