Convert PySpark column from strings to lists

I have a PySpark column that consists of strings and lists, the datatype of this column is a string, string (nullable = true), I want to change it to be only made of lists, and convert anything that isn't a list into NULL.

I've tried this code:

df.select(F.split(F.col("major_options"),",")).alias('major_options').show()

but it will create a list out of everything instead and not change the existing list datatype, I've also tried using the cast method however I got this error:

AnalysisException: [DATATYPE_MISMATCH.CAST_WITHOUT_SUGGESTION] Cannot resolve "major_options" due to data type mismatch: cannot cast "STRING" to "ARRAY".

Edit: here is how the column looks:

|['Alloy Wheels']                                                        |
|['Light Package', 'Alloy Wheels', 'Bluetooth']                          |
|Steering wheel memory                                                   |
|Knee airbag                                                             |
|['Sunroof/Moonroof', 'Alloy Wheels', 'Bluetooth', 'Parking Sensors']    |
|['Alloy Wheels', 'Bluetooth']                                           |

and here is how I want it to look:

|['Alloy Wheels']                                                        |
|['Light Package', 'Alloy Wheels', 'Bluetooth']                          |
|NULL                                                                    |
|NULL                                                                    |
|['Sunroof/Moonroof', 'Alloy Wheels', 'Bluetooth', 'Parking Sensors']    |
|['Alloy Wheels', 'Bluetooth']                                           |

Solution

You can use regexp_extract for checking if the string contains opening and closing square brackets and then split accordingly as follows:

from pyspark.sql import SparkSession
from pyspark.sql.functions import split, when, col, lit, regexp_extract

spark = SparkSession.builder.getOrCreate()

data = [
    ("['Alloy Wheels']",),
    ("['Light Package', 'Alloy Wheels', 'Bluetooth']",),
    ("Steering wheel memory",),
    ("Knee airbag",),
    ("['Sunroof/Moonroof', 'Alloy Wheels', 'Bluetooth', 'Parking Sensors']",),
    ("['Alloy Wheels', 'Bluetooth']",),
]

df = spark.createDataFrame(data, ["major_options"])

print("original schema:")
df.printSchema()

print("original df:")
df.show(truncate=False)

df = df.withColumn(
    "major_options",
    when(
        regexp_extract(col("major_options"), r"^\[(.*)\]$", 1) != "",
        split(regexp_extract(col("major_options"), r"^\[(.*)\]$", 1), ","),
    ).otherwise(lit(None)),
)

print("modified schema:")
df.printSchema()

print("modified df:")
df.show(truncate=False)

Output:

original schema:
root
 |-- major_options: string (nullable = true)

original df:
+--------------------------------------------------------------------+
|major_options                                                       |
+--------------------------------------------------------------------+
|['Alloy Wheels']                                                    |
|['Light Package', 'Alloy Wheels', 'Bluetooth']                      |
|Steering wheel memory                                               |
|Knee airbag                                                         |
|['Sunroof/Moonroof', 'Alloy Wheels', 'Bluetooth', 'Parking Sensors']|
|['Alloy Wheels', 'Bluetooth']                                       |
+--------------------------------------------------------------------+

modified schema:
root
 |-- major_options: array (nullable = true)
 |    |-- element: string (containsNull = false)

modified df:
+-----------------------------------------------------------------------+
|major_options                                                          |
+-----------------------------------------------------------------------+
|['Alloy Wheels']                                                       |
|['Light Package',  'Alloy Wheels',  'Bluetooth']                       |
|NULL                                                                   |
|NULL                                                                   |
|['Sunroof/Moonroof',  'Alloy Wheels',  'Bluetooth',  'Parking Sensors']|
|['Alloy Wheels',  'Bluetooth']                                         |
+-----------------------------------------------------------------------+