Search code examples
apache-sparkpysparkapache-spark-sqlaws-glue

How to handle Pipe and escape characters while reading pipe delimited files in PySpark


I have a pipe delimited file. How to handle backslash and pipe inside the content

Here is the input

col1|col2|col3|col4   
123|"\|"|""|2020.01.31 12:00:30  
456|"\"|""|2020.01.31 12:00:30  
678|"|"|""|2020.01.31 12:00:30 

I tried with below option

data =  spark.read.format("com.databricks.spark.csv")\
.option("inferSchema", "true")\
.option('header','true')\
.option('delimiter','|')\
.option("quote", '"')\
.option("escape","|")\
.option("escape", "\\")\
.option("timestampFormat", "yyyy.mm.dd hh:mm:ss")\
.load('s3://dummybucket/a.csv')

I got output as

col1|col2|col3|col4   
123|"\\|"|""|2020-01-31T00:00:30.000Z  
456|"\"|\""|2020-01-31T00:00:30.000Z|""  
678|"|"|""|2020-01-31T00:00:30.000Z

Expected output

col1|col2|col3|col4   
123|"\|"|""|2020-01-31T00:00:30.000Z   
456|"\"|""|2020-01-31T00:00:30.000Z    
678|"|"|""|2020-01-31T00:00:30.000Z   

Solution

  • You have declared escape twice. However, the property can be defined only once for a dataset. You will need to define this only once.

    .option("escape","|") \
    .option("escape", "\\") \
    
    • By default the value for escape will be \ if not defined (link here).

    • You have defined your quote character as ". An escape character is used to escape a quote character. For example "show this \"" would yield show this " if the quote character was " and escape was \.

    • For this particular example, you will either need to change your escape to a control character such as # or any value which does not appear before your quote character of ". Hence you can't use | as this appears before a quote on line 1. Also you can't use \ as that appears before quote on line 2.

    • You have to specify a value for escape as well due to default value being \. This would break line 2 for you.