Search code examples
pysparkdatabricks

How to ignore double quotes when reading CSV file in Pyspark?


I have a CSV file like:

enter image description here

ID, Tags
1,"""ApplicationID "": ""9AAG033396"",""Environment "": ""PROD"",""Remarks"": "" EUC Personal Desktop"""
2,"""ApplicationID "": ""9AAG033396"",""Environment "": ""PROD"",""Remarks"": "" EUC Personal Desktop"""
3,"""Remarks"": "" EUC Personal Desktop"",""ApplicationID "": ""9AAG033396"",""Environment "": ""PROD"""

Sample Tags values are given below.

"cluster": "csp-csp-cluster","kubernetes_node": ""

"Project": "S4","Type": "Dev"

"ApplicationID": "BSN0002278","ClusterId": "0803-071102-1n1kbom","ClusterName": "Riskman Quaitification Cluster","Creator": "[email protected]","DatabricksEnvironment": "workerenv-6764199050889752","Environment": "DEV","Remark": "EDP","RequestNo": "RITM0215195","Solution": "Riskman","Vendor": "Databricks","databricks-instance-name": "e82e5f2df9ae479bbb76766c48bdf3cc","x_Environment": "DEV"

"ApplicationID": "APP0005199","Environment": "Dev","Remark": "Buy@energy","RequestNo": "RITM0270765"

"ApplicationID": "","Environment": "","Remark": "","RequestNo": ""

"APM_ID": "empty","ApplicationID": "Empty","Environment": "PROD","Remark": "Logs","RequestNo": "empty"

I want to read it as a data frame in spark, where the values of every field are exactly as written in the CSV but without the quotes and need to create new column called ApplicationID.

Expected output:

enter image description here

ID, Tags,ApplicationID
1,ApplicationID : 9AAG033396,Environment : PROD,Remarks:  EUC Personal Desktop,9AAG033396
2,ApplicationID : 9AAG033396,Environment : PROD,Remarks:  EUC Personal Desktop,9AAG033396
3,Remarks:  EUC Personal Desktop,ApplicationID : 9AAG033396,Environment : PROD,9AAG033396

The output i get

enter image description here

ID, Tags,ApplicationID
1,"""ApplicationID "": ""9AAG033396"",9AAG033396
2,"""ApplicationID "": ""9AAG033396"",9AAG033396
3,"""Remarks"": "" EUC Personal Desktop"",null

In pyspark I am reading like this:

df = spark.read\
  .option("header", "true")\
  .option("inferSchema", "true")\
  .option("delimiter", ",")\
  .option("escapeQuotes", "true")\
  .option("multiLine","true")\
  .option('quote','"')\
.csv(f"wasbs://{container}@{storage_account_name}.blob.core.windows.net/onetimeazamortizecostnew/onetimeazamortizecostnew/20230901-20231031/onetimeazamortizecostnew_17fcdeca-81ca-43e7-b181-36bc379e9644.csv")
`df2 = df.withColumn("ApplicationID", when(df.Tags.contains("ApplicationID"),substring('Tags', 23,10))
                                    .otherwise(''))`

I wonder if there is a way to accomplish this using the escape option.


Solution

  • escape is exactly the option you are looking for. You want to set it to " so that double quote is quoted as "".

    df = spark.read
      .option("header", "true")
      .option("inferSchema", "true")
      .option("escape",'"')
      .csv(f"wasbs://{container}@{storage_account_name}.blob.core.windows.net/onetimeazamortizecostnew/onetimeazamortizecostnew/20230901-20231031/onetimeazamortizecostnew_17fcdeca-81ca-43e7-b181-36bc379e9644.csv")
    +---+----------------------------------------------------------------------------------------+
    |ID | Tags                                                                                   |
    +---+----------------------------------------------------------------------------------------+
    |1  |"ApplicationID ": "9AAG033396","Environment ": "PROD","Remarks": " EUC Personal Desktop"|
    |2  |"ApplicationID ": "9AAG033396","Environment ": "PROD","Remarks": " EUC Personal Desktop"|
    |3  |"Remarks": " EUC Personal Desktop","ApplicationID ": "9AAG033396","Environment ": "PROD"|
    +---+----------------------------------------------------------------------------------------+
    

    If you wrap your string in curly braces it can be parsed as JSON and you would be able to pull ApplicationID easily.

    schema = StructType([StructField(c, StringType()) for c in ["ApplicationID ", "Environment ", "Remarks"]])
    df.withColumn("Tags", from_json(concat(lit("{"), col(" Tags"), lit("}")), schema))\
      .withColumn("ApplicationID", col("Tags.ApplicationID "))
    +---+-----------------------------------------+-------------+
    |ID |Tags                                     |ApplicationID|
    +---+-----------------------------------------+-------------+
    |1  |{9AAG033396, PROD,  EUC Personal Desktop}|9AAG033396   |
    |2  |{9AAG033396, PROD,  EUC Personal Desktop}|9AAG033396   |
    |3  |{9AAG033396, PROD,  EUC Personal Desktop}|9AAG033396   |
    +---+-----------------------------------------+-------------+
    

    Note: you seem to have weird spaces in your CSV (like in "ApplicationID "). Not sure if that's how it looks like, but I wrote my code examples to match it.