Search code examples
azurepysparkazure-databricks

Extra backslash before every double quote when getting data cols from df and storing it in another df in a single column using "struct" function


I have a pyspark dataframe in which there are many columns like ID, Col1, Col2, Col3, Col4. Each of these column has string data in which is like this

ID = 123
Col1 = 1
Col2 = {"value": [-123.456, 789.123], "type": "abc"}
Col3 = [["value", [-123.456, 789.123]], ["type", "abc"]]
Col4 = [{"value": {"va1" : -123.456, "va2" : 789.123}, "type": ["abc", "def"]}, {"va1" : -123.456, "va2" : 789.123}, "type": ["abc", "def"]}]

The value in Col2, Col3, Col4 is of string data type. I want to add all of these columns in a single column like a dictionary, where key would be Col1, Col2, Col3, Col4 and they would have a corresponding value as their value.

I am doing this

new_df = old_df.withColumn("TF", struct(col(Col1), col(Col2), col(Col3)))

The new df is coming like:

TF: {
Col1 : 1
Col2 : {\"value\": [-123.456, 789.123], \"type\": \"abc\"}
Col3 = [[\"value\", [-123.456, 789.123]], [\"type\", \"abc\"]]
Col4 = [{\"value\": {\"va1\" : -123.456, \"va2\" : 789.123}, \"type\": [\"abc\", \"def\"]}, {\"va1\" : -123.456, \"va2\" : 789.123}, \"type\": [\"abc\", \"def\"]}]
}

I don't want the backslash character. I just want the following:

TF: {
Col1 : 1
Col2 : {"value": [-123.456, 789.123], "type": "abc"}
Col3 = [["value", [-123.456, 789.123]], ["type", "abc"]]
Col4 = [{"value": {"va1" : -123.456, "va2" : 789.123}, "type": ["abc", "def"]}, {"va1" : -123.456, "va2" : 789.123}, "type": ["abc", "def"]}]
}

I have tried regex but it is not working. How do I remove the backslashes from the data? I have to upload this data to CosmosDB. In CosmosDB it is showing backslash.


Solution

  • I have tried an example using pyspark:

    from pyspark.sql import SparkSession
    from pyspark.sql.functions import col, struct, regexp_replace, from_json
    from pyspark.sql.types import StringType, StructType
    data = [(123, 1, '{"value": [-123.456, 789.123], "type": "abc"}')]
    columns = ['ID', 'Col1', 'Col2']
    old_df = spark.createDataFrame(data, columns)
    print("Old DataFrame:")
    old_df.show(truncate=False)
    old_df = old_df.withColumn("Col2", regexp_replace(col("Col2"), "\\\\", ""))
    schema = StructType().add("value", StringType()).add("type", StringType())
    old_df = old_df.withColumn("Col2", from_json(col("Col2"), schema))
    new_df = old_df.withColumn("TF", struct(col("Col1"), col("Col2")))
    print("New DataFrame:")
    new_df.display(truncate=False)
    
    

    Results:

    | ID  | Col1 | Col2                                           |
    |-----|------|------------------------------------------------|
    | 123 | 1    | {"value": "[-123.456,789.123]", "type": "abc"} |
    
    
    | ID  | Col1 | Col2                                           | TF                                                                  |
    |-----|------|------------------------------------------------|---------------------------------------------------------------------|
    | 123 | 1    | {"value": "[-123.456,789.123]", "type": "abc"} | {"Col1": 1, "Col2": {"value": "[-123.456,789.123]", "type": "abc"}} |
    
    
    • You can use the regexp_replace function to eliminate the
      backslashes from the string in the ‘Col2’ column. This function is designed to substitute occurrences of a particular regular expression (in this case, backslashes) in the specified column with an empty string.
      By doing this, you can effectively remove all backslashes from the ‘Col2’ column.

    • Defining a schema named ‘schema’ with the StructType class, you can specify the structure for the nested data. This schema should include two fields, namely ‘value’ and ‘type’, both of which should be of type StringType. This way, you can create a structuredblueprint for organizing and handling data that contains these specific fields and types.

    • parse the JSON string in the ‘Col2’ column according to the predefined schema, you can use the ‘from_json’ function. This function is specifically designed to convert the JSON string into a structured format that adheres to the specified schema. By using this function, you can effectively transform the JSON string into a more organized and easily manipulable format based on the predefined schema.

    • You can add a new column named ‘TF’ to the DataFrame 'old_df’by using the ‘withColumn’ method. This new column, ‘TF’, can be created as a structure containing ‘Col1’ and the modified ‘Col2’, where ‘Col2’ has been parsed according to the previously defined schema. By employing this method, you can effectively integrate the structured data from ‘Col1’ and the transformed ‘Col2’ into the new column ‘TF’ within the DataFrame.