I have a CSV file like:
ID, Tags
1,"""ApplicationID "": ""9AAG033396"",""Environment "": ""PROD"",""Remarks"": "" EUC Personal Desktop"""
2,"""ApplicationID "": ""9AAG033396"",""Environment "": ""PROD"",""Remarks"": "" EUC Personal Desktop"""
3,"""Remarks"": "" EUC Personal Desktop"",""ApplicationID "": ""9AAG033396"",""Environment "": ""PROD"""
Sample Tags values are given below.
"cluster": "csp-csp-cluster","kubernetes_node": ""
"Project": "S4","Type": "Dev"
"ApplicationID": "BSN0002278","ClusterId": "0803-071102-1n1kbom","ClusterName": "Riskman Quaitification Cluster","Creator": "[email protected]","DatabricksEnvironment": "workerenv-6764199050889752","Environment": "DEV","Remark": "EDP","RequestNo": "RITM0215195","Solution": "Riskman","Vendor": "Databricks","databricks-instance-name": "e82e5f2df9ae479bbb76766c48bdf3cc","x_Environment": "DEV"
"ApplicationID": "APP0005199","Environment": "Dev","Remark": "Buy@energy","RequestNo": "RITM0270765"
"ApplicationID": "","Environment": "","Remark": "","RequestNo": ""
"APM_ID": "empty","ApplicationID": "Empty","Environment": "PROD","Remark": "Logs","RequestNo": "empty"
I want to read it as a data frame in spark, where the values of every field are exactly as written in the CSV but without the quotes and need to create new column called ApplicationID.
Expected output:
ID, Tags,ApplicationID
1,ApplicationID : 9AAG033396,Environment : PROD,Remarks: EUC Personal Desktop,9AAG033396
2,ApplicationID : 9AAG033396,Environment : PROD,Remarks: EUC Personal Desktop,9AAG033396
3,Remarks: EUC Personal Desktop,ApplicationID : 9AAG033396,Environment : PROD,9AAG033396
The output i get
ID, Tags,ApplicationID
1,"""ApplicationID "": ""9AAG033396"",9AAG033396
2,"""ApplicationID "": ""9AAG033396"",9AAG033396
3,"""Remarks"": "" EUC Personal Desktop"",null
In pyspark I am reading like this:
df = spark.read\
.option("header", "true")\
.option("inferSchema", "true")\
.option("delimiter", ",")\
.option("escapeQuotes", "true")\
.option("multiLine","true")\
.option('quote','"')\
.csv(f"wasbs://{container}@{storage_account_name}.blob.core.windows.net/onetimeazamortizecostnew/onetimeazamortizecostnew/20230901-20231031/onetimeazamortizecostnew_17fcdeca-81ca-43e7-b181-36bc379e9644.csv")
`df2 = df.withColumn("ApplicationID", when(df.Tags.contains("ApplicationID"),substring('Tags', 23,10))
.otherwise(''))`
I wonder if there is a way to accomplish this using the escape option.
escape
is exactly the option you are looking for. You want to set it to "
so that double quote is quoted as ""
.
df = spark.read
.option("header", "true")
.option("inferSchema", "true")
.option("escape",'"')
.csv(f"wasbs://{container}@{storage_account_name}.blob.core.windows.net/onetimeazamortizecostnew/onetimeazamortizecostnew/20230901-20231031/onetimeazamortizecostnew_17fcdeca-81ca-43e7-b181-36bc379e9644.csv")
+---+----------------------------------------------------------------------------------------+
|ID | Tags |
+---+----------------------------------------------------------------------------------------+
|1 |"ApplicationID ": "9AAG033396","Environment ": "PROD","Remarks": " EUC Personal Desktop"|
|2 |"ApplicationID ": "9AAG033396","Environment ": "PROD","Remarks": " EUC Personal Desktop"|
|3 |"Remarks": " EUC Personal Desktop","ApplicationID ": "9AAG033396","Environment ": "PROD"|
+---+----------------------------------------------------------------------------------------+
If you wrap your string in curly braces it can be parsed as JSON and you would be able to pull ApplicationID easily.
schema = StructType([StructField(c, StringType()) for c in ["ApplicationID ", "Environment ", "Remarks"]])
df.withColumn("Tags", from_json(concat(lit("{"), col(" Tags"), lit("}")), schema))\
.withColumn("ApplicationID", col("Tags.ApplicationID "))
+---+-----------------------------------------+-------------+
|ID |Tags |ApplicationID|
+---+-----------------------------------------+-------------+
|1 |{9AAG033396, PROD, EUC Personal Desktop}|9AAG033396 |
|2 |{9AAG033396, PROD, EUC Personal Desktop}|9AAG033396 |
|3 |{9AAG033396, PROD, EUC Personal Desktop}|9AAG033396 |
+---+-----------------------------------------+-------------+
Note: you seem to have weird spaces in your CSV (like in "ApplicationID "
). Not sure if that's how it looks like, but I wrote my code examples to match it.