I'm trying to read (dirty) CSV files from a cloud storage with PySpark which have <boolean>\r
values at the end of the line sometimes. This is not always the case, the column also contains correct booleans or even nothing (=null).
I get a schema from another place which specifies the column types (I have 200+ columns, the one in question is the last one). The column type for the last column is boolean - but since PySpark cannot interpret True\r
values as True
(same for False\r
) I get nulls where I don't want to get nulls.
Data example:
id,mybool
1,True\r
2,False
3,
4,False\r
5,True
6,True\r
How can I ensure that PySpark can interpret the values?
I was thinking of editing the schema that I get and change the DataType to String, clean it up and cast it to Boolean, but I was hoping that there is a better way.
I went with the following option for now - replacing the DataType with StringType() and then using regex_replace to clean up the column:
df = df.withColumn("mybool", regexp_replace("mybool", r"\r", "").cast(BooleanType()))