apache-spark pyspark apache-spark-sql aws-glue

How to handle Pipe and escape characters while reading pipe delimited files in PySpark

I have a pipe delimited file. How to handle backslash and pipe inside the content

Here is the input

col1|col2|col3|col4   
123|"\|"|""|2020.01.31 12:00:30  
456|"\"|""|2020.01.31 12:00:30  
678|"|"|""|2020.01.31 12:00:30

I tried with below option

data =  spark.read.format("com.databricks.spark.csv")\
.option("inferSchema", "true")\
.option('header','true')\
.option('delimiter','|')\
.option("quote", '"')\
.option("escape","|")\
.option("escape", "\\")\
.option("timestampFormat", "yyyy.mm.dd hh:mm:ss")\
.load('s3://dummybucket/a.csv')

I got output as

col1|col2|col3|col4   
123|"\\|"|""|2020-01-31T00:00:30.000Z  
456|"\"|\""|2020-01-31T00:00:30.000Z|""  
678|"|"|""|2020-01-31T00:00:30.000Z

Expected output

col1|col2|col3|col4   
123|"\|"|""|2020-01-31T00:00:30.000Z   
456|"\"|""|2020-01-31T00:00:30.000Z    
678|"|"|""|2020-01-31T00:00:30.000Z

Solution

You have declared escape twice. However, the property can be defined only once for a dataset. You will need to define this only once.

.option("escape","|") \
.option("escape", "\\") \

By default the value for escape will be \ if not defined (link here).
You have defined your quote character as ". An escape character is used to escape a quote character. For example "show this \"" would yield show this " if the quote character was " and escape was \.
For this particular example, you will either need to change your escape to a control character such as # or any value which does not appear before your quote character of ". Hence you can't use | as this appears before a quote on line 1. Also you can't use \ as that appears before quote on line 2.
You have to specify a value for escape as well due to default value being \. This would break line 2 for you.