I have a PySpark column that consists of strings and lists, the datatype of this column is a string, string (nullable = true)
, I want to change it to be only made of lists, and convert anything that isn't a list into NULL
.
I've tried this code:
df.select(F.split(F.col("major_options"),",")).alias('major_options').show()
but it will create a list out of everything instead and not change the existing list datatype, I've also tried using the cast
method however I got this error:
AnalysisException: [DATATYPE_MISMATCH.CAST_WITHOUT_SUGGESTION] Cannot resolve "major_options" due to data type mismatch: cannot cast "STRING" to "ARRAY".
Edit: here is how the column looks:
|['Alloy Wheels'] |
|['Light Package', 'Alloy Wheels', 'Bluetooth'] |
|Steering wheel memory |
|Knee airbag |
|['Sunroof/Moonroof', 'Alloy Wheels', 'Bluetooth', 'Parking Sensors'] |
|['Alloy Wheels', 'Bluetooth'] |
and here is how I want it to look:
|['Alloy Wheels'] |
|['Light Package', 'Alloy Wheels', 'Bluetooth'] |
|NULL |
|NULL |
|['Sunroof/Moonroof', 'Alloy Wheels', 'Bluetooth', 'Parking Sensors'] |
|['Alloy Wheels', 'Bluetooth'] |
You can use regexp_extract for checking if the string contains opening and closing square brackets and then split accordingly as follows:
from pyspark.sql import SparkSession
from pyspark.sql.functions import split, when, col, lit, regexp_extract
spark = SparkSession.builder.getOrCreate()
data = [
("['Alloy Wheels']",),
("['Light Package', 'Alloy Wheels', 'Bluetooth']",),
("Steering wheel memory",),
("Knee airbag",),
("['Sunroof/Moonroof', 'Alloy Wheels', 'Bluetooth', 'Parking Sensors']",),
("['Alloy Wheels', 'Bluetooth']",),
]
df = spark.createDataFrame(data, ["major_options"])
print("original schema:")
df.printSchema()
print("original df:")
df.show(truncate=False)
df = df.withColumn(
"major_options",
when(
regexp_extract(col("major_options"), r"^\[(.*)\]$", 1) != "",
split(regexp_extract(col("major_options"), r"^\[(.*)\]$", 1), ","),
).otherwise(lit(None)),
)
print("modified schema:")
df.printSchema()
print("modified df:")
df.show(truncate=False)
Output:
original schema:
root
|-- major_options: string (nullable = true)
original df:
+--------------------------------------------------------------------+
|major_options |
+--------------------------------------------------------------------+
|['Alloy Wheels'] |
|['Light Package', 'Alloy Wheels', 'Bluetooth'] |
|Steering wheel memory |
|Knee airbag |
|['Sunroof/Moonroof', 'Alloy Wheels', 'Bluetooth', 'Parking Sensors']|
|['Alloy Wheels', 'Bluetooth'] |
+--------------------------------------------------------------------+
modified schema:
root
|-- major_options: array (nullable = true)
| |-- element: string (containsNull = false)
modified df:
+-----------------------------------------------------------------------+
|major_options |
+-----------------------------------------------------------------------+
|['Alloy Wheels'] |
|['Light Package', 'Alloy Wheels', 'Bluetooth'] |
|NULL |
|NULL |
|['Sunroof/Moonroof', 'Alloy Wheels', 'Bluetooth', 'Parking Sensors']|
|['Alloy Wheels', 'Bluetooth'] |
+-----------------------------------------------------------------------+