Search code examples
apache-sparkpysparkdatabricksdelimiter

Pyspark reading csv delimiter not parsed for some data


csv_df = spark.read.option("header", "true") 
              .csv(path, sep = '┐') 

A small portion of the data cannot be parsed correctly and ends up all in the first column in format "str┐str┐str┐str┐str┐str┐str┐str" and the other columns are null. The number of delimiters in these rows are the same as those that were parsed correctly. There are also nothing else special about the rows that were not parsed from I could tell. Any idea what might be causing this and how to fix?

An example that failed parsing:

FUNDACAO ESTATAL SAUDE DA FAMILIA FESF┐VIP┐BR┐Creative Cloud All Apps┐PAID┐SMB┐1┐1┐2┐2022-07-29

Solution

  • I'll go through the steps I've taken so far to debug and offer what I think is the most likely solution.

    I created a one row .csv file with '┐' as the delimiting character (and also a header row):

    Header1┐Header2┐Header3┐Header4┐Header5┐Header6┐Header7┐Header8┐Header9┐Header10
    FUNDACAO ESTATAL SAUDE DA FAMILIA FESF┐VIP┐BR┐Creative Cloud All Apps┐PAID┐SMB┐1┐1┐2┐2022-07-29
    

    And when I run the line:

    csv_df = spark.read.option("header", "true").csv(path, sep = '┐') 
    

    The dataframe loads correctly:

    +--------------------+-------+-------+--------------------+-------+-------+-------+-------+-------+----------+
    |             Header1|Header2|Header3|             Header4|Header5|Header6|Header7|Header8|Header9|  Header10|
    +--------------------+-------+-------+--------------------+-------+-------+-------+-------+-------+----------+
    |FUNDACAO ESTATAL ...|    VIP|     BR|Creative Cloud Al...|   PAID|    SMB|      1|      1|      2|2022-07-29|
    +--------------------+-------+-------+--------------------+-------+-------+-------+-------+-------+----------+
    

    However, if I put quotation marks around the first non-header row, then this will escape all of the '┐'delimiter symbols in that row, so they won't be parsed.

    Header1┐Header2┐Header3┐Header4┐Header5┐Header6┐Header7┐Header8┐Header9┐Header10
    "FUNDACAO ESTATAL SAUDE DA FAMILIA FESF┐VIP┐BR┐Creative Cloud All Apps┐PAID┐SMB┐1┐1┐2┐2022-07-29"
    

    This will lead to the behavior you observed when you try to load the csv file:

    +--------------------+-------+-------+-------+-------+-------+-------+-------+-------+--------+
    |             Header1|Header2|Header3|Header4|Header5|Header6|Header7|Header8|Header9|Header10|
    +--------------------+-------+-------+-------+-------+-------+-------+-------+-------+--------+
    |FUNDACAO ESTATAL ...|   null|   null|   null|   null|   null|   null|   null|   null|    null|
    +--------------------+-------+-------+-------+-------+-------+-------+-------+-------+--------+
    

    Therefore, I think your csv file most likely has quotation marks around your row, or that there are one or more characters inside your csv file next to the location of that row that are causing the problem – the problem is probably with the csv file itself rather than the pyspark csv parser.