I have csv file in which I am getting double quotes in a column. While reading and writing I have to remove those quotes. Please guide me how can I do it?
Example-
df:
col1
"xyznm""cxvb"
I want below output-
col1
xyznm""cxvb
I have written below code for this-
df = spark.read.format("com.databricks.spark.csv").option("delimiter", "|").options(header='true', escape = '\"').load("my_path")
df.show()
df.write.format('com.databricks.spark.csv').mode('overwrite').save(r"path", sep="|", escape='\"', header='True', nullValue= None)
One possible workaround is to remove leading and trailing quotes after reading in your csv.
Let's say you load this df:
df = spark.createDataFrame(["\"xyznm\"\"cxvb\"","1\"1\"","\"13"], "string").toDF("col1")
+-------------+
| col1|
+-------------+
|"xyznm""cxvb"|
| 1"1"|
| "13|
+-------------+
Then you can use the following regex to remove outer quotes:
from pyspark.sql import functions as F
df.select(F.regexp_replace('col1', '^"+|"+$', '').alias('col1')).show()
+-----------+
| col1|
+-----------+
|xyznm""cxvb|
| 1"1|
| 13|
+-----------+