Search code examples
pythonpython-3.xcsvapache-sparkpyspark

Reading text file in Pyspark with delimiters present within double quotes


I have text file similar to the below example I am using the encoder ISO-8859-1 as and separator as þ

The raw data is something like this of name "test.txt"

idþnameþroleþ expþ task_descþ comp

1þJohn Doeþ"Senior Developerþ 4þ working on the PySpark project"þ Google

I need the data to look like this

id name role exp task_desc comp
1 John Doe "Senior Developer 4 working on the PySpark project" Google

I am using the below code to run the raw "test.txt" file

spark_df = spark.read.options( multiline='True', quote='"', escape='"', encoding='ISO-8859-1', mode='PERMISSIVE').csv('test.txt', header=True, sep='þ')

I have also used the below mentioned quote and escape characters.

quote="\"", escape="\""

Is there a solution to this problem in Pyspark?


Solution

  • You need to have quote='' to make it work.