Search code examples
python-3.xdataframepysparkazure-databricks

Handling null value in pyspark dataframe


I have pyspark dataframe with some data and i want to substring some data of a column, that column also contain some null value. here is my dataframe

+-------------+
|          Name|
+--------------+
| Asia201909284|
|    US20190928|
|Europ201909287|
|          null|
|     something|
|       nothing|
+--------------+

all i want to eliminate Asia, US, Europ from column Name

Here is my code that I already tried.

fun_asia = udf(lambda x: x[4:len(x)])
fun_us = udf(lambda x: x[2:len(x)])
fun_europ = udf(lambda x: x[5:len(x)])
df1.withColumn("replace", \
               when(df1.Name.isNull(),df1.Name)\
               .when(df1.Name.like("Asia%"),fun_asia(col('Name')))\
               .when(df1.Name.like("US%"),fun_us(col('Name')))\
               .when(df1.Name.like("Europ%"),fun_europ(col('Name')))
               .otherwise(df1.Name)
              ).show()

It worked properly if there is no null value in that column. but if there is some null value it gave a error like len() cant calculate null value.

Error massage

TypeError: object of type 'NoneType' has no len()

Things that i confused why its calling fun also for null value. and how can i overcome my problem and get the result i want, any help appreciate.

Actual result that i want

+--------------+---------+
|          Name|  replace|
+--------------+---------+
| Asia201909284|201909284|
|    US20190928| 20190928|
|Europ201909287|201909287|
|          null|     null|
|     something|something|
|       nothing|  nothing|
+--------------+---------+

Solution

  • One approach is using a when with the isNull() condition to handle the when column is null condition:

    df1.withColumn("replace", \
                   when(df1.Name.like("Asia%"),fun_asia(col('Name')))\
                   .when(df1.Name.like("US%"),fun_us(col('Name')))\
                   .when(df1.Name.like("Europ%"),fun_europ(col('Name')))
                   .when(df1.Name.isNull(), df1.Name)
                   .otherwise(df1.Name)
                  ).show()
    

    EDIT2 :

    You can change your udf to handle the nulls:

    fun_asia = udf(lambda x: x[4:len(x)] if x else None)
    fun_us = udf(lambda x: x[2:len(x)] if x else None)
    fun_europ = udf(lambda x: x[5:len(x)] if x else None)
    df1.withColumn("replace", \
                   when(df1.Name.isNull(),df1.Name)\
                   .when(df1.Name.like("Asia%"),fun_asia(col('Name')))\
                   .when(df1.Name.like("US%"),fun_us(col('Name')))\
                   .when(df1.Name.like("Europ%"),fun_europ(col('Name')))
                   .otherwise(df1.Name)
                  ).show()
    +--------------+---------+
    |          Name|  replace|
    +--------------+---------+
    | Asia201909284|201909284|
    |    US20190928| 20190928|
    |Europ201909287|201909287|
    |          null|     null|
    |     something|something|
    |       nothing|  nothing|
    +--------------+---------+