Search code examples
pythoncsvdataframepyspark

How to remove the double quote when the value is empty in Spark?


I would like to remove the double-quotes "" when the value is empty when registering my CSV in S3 with the method df.write.save() of Spark

Spark Version : 2.4.0

Python Version : 3.6.5

Here is my code in Python for load the csv file:

df = spark.read.load(
    path('in'),
    format = 'csv',
    delimiter = '|',
    encoding = 'utf-8',
    header = 'true'
)

The loaded CSV file:

|id|first_name|last_name|zip_code|
|1 |          |Elsner   |57315   |
|2 |Noelle    |         |        |
|3 |James     |Moser    |48256   |

Here is my code in Python for write the csv file:

df.write.save(
    path('out'),
    format = 'csv',
    delimiter = '|',
    header = 'true'
)

The CSV file written:

|id|first_name|last_name|zip_code|
|1 |""        |Elsner   |57315   |
|2 |Noelle    |""       |""      |
|3 |James     |Moser    |48256   |

How to remove the double quote when registering?

Thank you very much by advance.


Solution

  • You have empty string in your data frame, if you want to write them as nulls, you can replace empty string with null, and then set nullValues=None when saving it:

    df.replace('', None)              # replace empty string with null
      .write.save(
        path('out'), 
        format='csv', 
        delimiter='|', 
        header=True, 
        nullValue=None                # write null value as None
      )
    

    And it will saved as:

    id|first_name|last_name|zip_code
    1||Elsner|57315
    2|Noelle||
    3|James|Moser|48256