Search code examples
apache-sparkspark-csv

Spark - handle blank values in CSV file


Let's say I've got a simple pipe delimited file, with missing values:

A|B||D

I read that into a dataframe:

val foo =  spark.read.format("csv").option("delimiter","|").load("/path/to/my/file.txt")

The missing third column, instead of being a null value, has the string null:

+-----------+----+
|_c0|_c1| _c2|_c3|
+-----------+----+
|a  |b  |null|c  |

I would like to be able to control what Spark does with that missing value. According to the CSV Documentation, there is a nullValue option:

nullValue: string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame.

But since I don't have any value there, I can't seem to control it that way.

Short of testing every single field of every single file I read in, is there way to control what spark does with these blanks? For example, replace it with 'N/A'.

We're using Spark 2.1, if it's relevant.


Solution

  • The nullValue and emptyValue options do the opposite of your expectation - they allow you to specify values that, if encountered in the source data, should be turned into null or "" (respectively) in the resultant dataframe. For instance, if your input was instead

    a|b|N/A|d

    and you set

    option("nullValue", "N/A")

    you would wind up once again with

    +-----------+----+
    |_c0|_c1| _c2|_c3|
    +-----------+----+
    |a  |b  |null|c  |
    

    More to the point, Spark includes functions for dealing with null values in the class DataFrameNaFunctions. For example, df.na.fill("N/A") will replace all null instances in string columns with "N/A".