Search code examples
dataframeapache-sparkpysparkpyspark-pandas

How to remove quotes from column in pyspark dataframe?


I have csv file in which I am getting double quotes in a column. While reading and writing I have to remove those quotes. Please guide me how can I do it?

Example-

df:

col1
"xyznm""cxvb"

I want below output-

col1
xyznm""cxvb

I have written below code for this-

df = spark.read.format("com.databricks.spark.csv").option("delimiter", "|").options(header='true', escape = '\"').load("my_path")

df.show()

df.write.format('com.databricks.spark.csv').mode('overwrite').save(r"path", sep="|", escape='\"', header='True', nullValue= None)

Solution

  • One possible workaround is to remove leading and trailing quotes after reading in your csv.

    Let's say you load this df:

    df = spark.createDataFrame(["\"xyznm\"\"cxvb\"","1\"1\"","\"13"], "string").toDF("col1")
    
    +-------------+
    |         col1|
    +-------------+
    |"xyznm""cxvb"|
    |         1"1"|
    |          "13|
    +-------------+
    

    Then you can use the following regex to remove outer quotes:

    from pyspark.sql import functions as F
    
    df.select(F.regexp_replace('col1', '^"+|"+$', '').alias('col1')).show()
    
    +-----------+
    |       col1|
    +-----------+
    |xyznm""cxvb|
    |        1"1|
    |         13|
    +-----------+