I have a column with data coming in as an string representation of an array
I tried to type cast it to an array type but the data is getting modified.
I tried to use regex as well to remove the extra brackets but its not working.
attaching the code below This code is to convert the string representation of array to actual array
df = df.withColumn("columns", split(df["columns"], ", "))
This is the regex code i tried
df = df.withColumn(
'columns',
expr("transform(split(columns, ','), x -> trim('\"[]', x))")
)
would really appreciate any help
I have tried the following approach:
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import ArrayType, StringType
data = [("1", "[\"a\", \"b\", \"c\"]"), ("2", "[\"d\", \"e\", \"f\"]")]
df = spark.createDataFrame(data, ["id", "columns"])
array_schema = ArrayType(StringType())
df = df.withColumn("columns", from_json(col("columns"), array_schema))
df.show()
In the above code, I defined the schema for the array using ArrayType(StringType())
. Next, I specified that the array contains strings.
Using the withColumn
method, combined with from_json
, I transformed the "columns" column into an array of strings based on the specified schema.
Results:
+---+---------+
| id| columns|
+---+---------+
| 1|[a, b, c]|
| 2|[d, e, f]|
+---+---------+