Search code examples
pyspark

Convert PySpark column from strings to lists


I have a PySpark column that consists of strings and lists, the datatype of this column is a string, string (nullable = true), I want to change it to be only made of lists, and convert anything that isn't a list into NULL.

I've tried this code:

df.select(F.split(F.col("major_options"),",")).alias('major_options').show()

but it will create a list out of everything instead and not change the existing list datatype, I've also tried using the cast method however I got this error:

AnalysisException: [DATATYPE_MISMATCH.CAST_WITHOUT_SUGGESTION] Cannot resolve "major_options" due to data type mismatch: cannot cast "STRING" to "ARRAY".

Edit: here is how the column looks:

|['Alloy Wheels']                                                        |
|['Light Package', 'Alloy Wheels', 'Bluetooth']                          |
|Steering wheel memory                                                   |
|Knee airbag                                                             |
|['Sunroof/Moonroof', 'Alloy Wheels', 'Bluetooth', 'Parking Sensors']    |
|['Alloy Wheels', 'Bluetooth']                                           |

and here is how I want it to look:

|['Alloy Wheels']                                                        |
|['Light Package', 'Alloy Wheels', 'Bluetooth']                          |
|NULL                                                                    |
|NULL                                                                    |
|['Sunroof/Moonroof', 'Alloy Wheels', 'Bluetooth', 'Parking Sensors']    |
|['Alloy Wheels', 'Bluetooth']                                           |


Solution

  • You can use regexp_extract for checking if the string contains opening and closing square brackets and then split accordingly as follows:

    from pyspark.sql import SparkSession
    from pyspark.sql.functions import split, when, col, lit, regexp_extract
    
    spark = SparkSession.builder.getOrCreate()
    
    data = [
        ("['Alloy Wheels']",),
        ("['Light Package', 'Alloy Wheels', 'Bluetooth']",),
        ("Steering wheel memory",),
        ("Knee airbag",),
        ("['Sunroof/Moonroof', 'Alloy Wheels', 'Bluetooth', 'Parking Sensors']",),
        ("['Alloy Wheels', 'Bluetooth']",),
    ]
    
    df = spark.createDataFrame(data, ["major_options"])
    
    print("original schema:")
    df.printSchema()
    
    print("original df:")
    df.show(truncate=False)
    
    df = df.withColumn(
        "major_options",
        when(
            regexp_extract(col("major_options"), r"^\[(.*)\]$", 1) != "",
            split(regexp_extract(col("major_options"), r"^\[(.*)\]$", 1), ","),
        ).otherwise(lit(None)),
    )
    
    print("modified schema:")
    df.printSchema()
    
    print("modified df:")
    df.show(truncate=False)
    

    Output:

    original schema:
    root
     |-- major_options: string (nullable = true)
    
    original df:
    +--------------------------------------------------------------------+
    |major_options                                                       |
    +--------------------------------------------------------------------+
    |['Alloy Wheels']                                                    |
    |['Light Package', 'Alloy Wheels', 'Bluetooth']                      |
    |Steering wheel memory                                               |
    |Knee airbag                                                         |
    |['Sunroof/Moonroof', 'Alloy Wheels', 'Bluetooth', 'Parking Sensors']|
    |['Alloy Wheels', 'Bluetooth']                                       |
    +--------------------------------------------------------------------+
    
    modified schema:
    root
     |-- major_options: array (nullable = true)
     |    |-- element: string (containsNull = false)
    
    modified df:
    +-----------------------------------------------------------------------+
    |major_options                                                          |
    +-----------------------------------------------------------------------+
    |['Alloy Wheels']                                                       |
    |['Light Package',  'Alloy Wheels',  'Bluetooth']                       |
    |NULL                                                                   |
    |NULL                                                                   |
    |['Sunroof/Moonroof',  'Alloy Wheels',  'Bluetooth',  'Parking Sensors']|
    |['Alloy Wheels',  'Bluetooth']                                         |
    +-----------------------------------------------------------------------+