I have a pipe delimited file. How to handle backslash and pipe inside the content
Here is the input
col1|col2|col3|col4
123|"\|"|""|2020.01.31 12:00:30
456|"\"|""|2020.01.31 12:00:30
678|"|"|""|2020.01.31 12:00:30
I tried with below option
data = spark.read.format("com.databricks.spark.csv")\
.option("inferSchema", "true")\
.option('header','true')\
.option('delimiter','|')\
.option("quote", '"')\
.option("escape","|")\
.option("escape", "\\")\
.option("timestampFormat", "yyyy.mm.dd hh:mm:ss")\
.load('s3://dummybucket/a.csv')
I got output as
col1|col2|col3|col4
123|"\\|"|""|2020-01-31T00:00:30.000Z
456|"\"|\""|2020-01-31T00:00:30.000Z|""
678|"|"|""|2020-01-31T00:00:30.000Z
Expected output
col1|col2|col3|col4
123|"\|"|""|2020-01-31T00:00:30.000Z
456|"\"|""|2020-01-31T00:00:30.000Z
678|"|"|""|2020-01-31T00:00:30.000Z
You have declared escape
twice. However, the property can be defined only once for a dataset. You will need to define this only once.
.option("escape","|") \
.option("escape", "\\") \
By default the value for escape
will be \
if not defined (link here).
You have defined your quote character as "
. An escape character is used to escape a quote character. For example "show this \""
would yield show this "
if the quote character was "
and escape was \
.
For this particular example, you will either need to change your escape
to a control character such as #
or any value which does not appear before your quote character of "
. Hence you can't use |
as this appears before a quote on line 1. Also you can't use \
as that appears before quote on line 2.
You have to specify a value for escape
as well due to default value being \
. This would break line 2 for you.