Search code examples
azureapache-sparkpysparkcasting

how to convert an string representation of an array into an actual array type in pyspark


I have a column with data coming in as an string representation of an array

enter image description here

I tried to type cast it to an array type but the data is getting modified.

enter image description here

I tried to use regex as well to remove the extra brackets but its not working.

attaching the code below This code is to convert the string representation of array to actual array

df = df.withColumn("columns", split(df["columns"], ", "))

This is the regex code i tried

df = df.withColumn(
'columns',
expr("transform(split(columns, ','), x -> trim('\"[]', x))")

)

would really appreciate any help


Solution

  • I have tried the following approach:

    from pyspark.sql.functions import from_json, col
    from pyspark.sql.types import ArrayType, StringType
    data = [("1", "[\"a\", \"b\", \"c\"]"), ("2", "[\"d\", \"e\", \"f\"]")]
    df = spark.createDataFrame(data, ["id", "columns"])
    array_schema = ArrayType(StringType())
    df = df.withColumn("columns", from_json(col("columns"), array_schema))
    df.show()
    

    In the above code, I defined the schema for the array using ArrayType(StringType()). Next, I specified that the array contains strings.

    Using the withColumn method, combined with from_json, I transformed the "columns" column into an array of strings based on the specified schema.

    Results:

    +---+---------+
    | id|  columns|
    +---+---------+
    |  1|[a, b, c]|
    |  2|[d, e, f]|
    +---+---------+