Search code examples
apache-sparkpysparkazure-databricks

Pyspark - Coverting String to Array


I have a dataframe and it has string values where i have an array.


alg_mappings = {
    ('Full Cover', 40): [['base,permitted_usage'],['si_mv'],['suburb']]# Add more values as needed
}

default_value = None

def get_alg_value(sub_class, version_number):
    return alg_mappings.get((sub_class, version_number), default_value)

get_alg_value_udf = F.udf(get_alg_value)

df_with_alg = df.withColumn("alg", get_alg_value_udf(F.col("sub_class"), F.col("version")))

alg column is a string, but i want it to be an array element with the exact format of

[['base,permitted_usage'],['si_mv'],['suburb']]

I will be adding more elements to it, so it could even be size of 25 ++. Hence, need the most efficient way to convert it into an array. Will be adding more keys as well.


Solution

  • I suggest you to use a decorator to specify the output data type on the UDF. Default is string so you get a string representation of the output.

    SOLUTIONS

    Output as a list of strings

    @udf(ArrayType(StringType()))
    def get_alg_value(sub_class, version_number):
        return alg_mappings.get((sub_class, version_number), default_value)
    

    Output as a list of lists of strings

    @udf(ArrayType(ArrayType(StringType())))
    def get_alg_value(sub_class, version_number):
        return alg_mappings.get((sub_class, version_number), default_value)