I have a dataframe df1 :
A | B |
---|---|
null | test 1 |
none | test 3 |
AAAA | test 2 |
BBBB | test 4 |
CCCC | test 5 |
I want to keep only 3 value for column A : null, AAAA and BBBB I did the filtrer :
df2= df1.select(col('*'),F.when(~col("A").isin(['AAAA','BBBB']),"null")
.otherwise(col("A")).alias("C")
)
I want to have this :
A | C | B |
---|---|---|
null | null | test 1 |
none | null | test 3 |
AAAA | AAAA | test 2 |
BBBB | BBBB | test 4 |
CCCC | null | test 5 |
but what i got is différent , I lost all the rows with value equal to AAAA :
A | C | B |
---|---|---|
null | null | test 1 |
none | null | test 3 |
BBBB | BBBB | test 4 |
CCCC | null | test 5 |
do you know what is the problem ? Thanks
This is the effective code, for filtering column A
with values [AAAA, BBBB]
and NULL
from pyspark.sql import functions as F
df1 = df1.withColumn('C', F.when(F.col('A').isin(['AAAA', 'BBBB']), F.col('A')).otherwise(F.lit(None)))
You weren't filtering the columns, but creating a new column, based on the values of A
.