Search code examples
pythoncsvpyspark

Read value with hidden carriage return in boolean field as boolean


I'm trying to read (dirty) CSV files from a cloud storage with PySpark which have <boolean>\r values at the end of the line sometimes. This is not always the case, the column also contains correct booleans or even nothing (=null).

I get a schema from another place which specifies the column types (I have 200+ columns, the one in question is the last one). The column type for the last column is boolean - but since PySpark cannot interpret True\r values as True (same for False\r) I get nulls where I don't want to get nulls.

Data example:

id,mybool
1,True\r
2,False
3,
4,False\r
5,True
6,True\r

How can I ensure that PySpark can interpret the values?

I was thinking of editing the schema that I get and change the DataType to String, clean it up and cast it to Boolean, but I was hoping that there is a better way.


Solution

  • I went with the following option for now - replacing the DataType with StringType() and then using regex_replace to clean up the column:

    df = df.withColumn("mybool", regexp_replace("mybool", r"\r", "").cast(BooleanType()))