Search code examples
loopspysparkreplacedata-cleaning

pyspark remove empty values from all columns and replace it with null


I am trying to clean my dataset from missing values. In the rows there are values like

ID A B
1 324
2 Breda
3 null 34556

I would like to see in A1 and B2 null and so on without doing the cleaning column by column. I would like to loop over each columns without specifying the column names

I have found this code but the last raw returns an error : My table name is custom

def replaceEmptyCols(columns:Array[String]):Array[Column]={
    columns.map(c>={
      when(col(c)=="" ,null).otherwise(col(c)).alias(c)
    })
  }
custom.select(replaceEmptyCols(custom.columns):_*).show()

The error is :

SyntaxError: invalid syntax (, line 6)
  File "<command-447346330485202>", line 6
    custom.select(replaceEmptyCols(custom.columns):_*).show()
                                                  ^
SyntaxError: invalid syntax

Solution

  • Maybe you are looking for something like this?

    custom = spark.createDataFrame(
        [('1','','324')
        ,('2','Breda','')
        ,('3',None,'34556')
        ],
        ['ID','A','B']
    )
    
    custom.show()
    
    # +---+-----+-----+
    # | ID|    A|    B|
    # +---+-----+-----+
    # |  1|     |  324|
    # |  2|Breda|     |
    # |  3| null|34556|
    # +---+-----+-----+
    
    import pyspark.sql.functions as F
    from pyspark.sql.types import *
    
    
    def replaceEmptyCols(df, columns:[]):
        for c in columns:
            df = df.withColumn(c, F.when((F.col(c) == '')  | (F.col(c) == None), F.lit('null')
                                        ).otherwise(F.col(c)))
        return df
    
    replaceEmptyCols(custom, [c for c in custom.columns if c not in ['ID']]).show()
            
    # +---+-----+-----+
    # | ID|    A|    B|
    # +---+-----+-----+
    # |  1| null|  324|
    # |  2|Breda| null|
    # |  3| null|34556|
    # +---+-----+-----+