dataframe apache-spark pyspark pyspark-pandas

How to remove quotes from column in pyspark dataframe?

I have csv file in which I am getting double quotes in a column. While reading and writing I have to remove those quotes. Please guide me how can I do it?

Example-

df:

col1
"xyznm""cxvb"

I want below output-

col1
xyznm""cxvb

I have written below code for this-

df = spark.read.format("com.databricks.spark.csv").option("delimiter", "|").options(header='true', escape = '\"').load("my_path")

df.show()

df.write.format('com.databricks.spark.csv').mode('overwrite').save(r"path", sep="|", escape='\"', header='True', nullValue= None)

Solution

One possible workaround is to remove leading and trailing quotes after reading in your csv.

Let's say you load this df:

df = spark.createDataFrame(["\"xyznm\"\"cxvb\"","1\"1\"","\"13"], "string").toDF("col1")

+-------------+
|         col1|
+-------------+
|"xyznm""cxvb"|
|         1"1"|
|          "13|
+-------------+

Then you can use the following regex to remove outer quotes:

from pyspark.sql import functions as F

df.select(F.regexp_replace('col1', '^"+|"+$', '').alias('col1')).show()

+-----------+
|       col1|
+-----------+
|xyznm""cxvb|
|        1"1|
|         13|
+-----------+