Search code examples
pysparkcoalesce

how to replace missing values from another column in PySpark?


I want to use values in t5 to replace some missing values in t4. Searched code, but doesn’t work for me

example of target

df is a dataframe.Code:

pdf = df.toPandas()  

from pyspark.sql.functions import coalesce
pdf.withColumn("t4", coalesce(pdf.t4, pdf.t5))

 Error: 'DataFrame' object has no attribute 'withColumn'

Also, tried the following code previously, didnt work neither.

new_pdf=pdf['t4'].fillna(method='bfill', axis="columns")

Error: No axis named columns for object type


Solution

  • Like the error indicates .withColumn() is not a method of pandas dataframes but spark dataframes. Note that when using .toPandas() your pdf becomes a pandas dataframe, so if you want to use .withColumn() avoid the transformation

    UPDATE: If pdf is a pandas dataframe you can do:

    pdf['t4']=pdf['t4'].fillna(pdf['t5'])