Search code examples
python-3.xpysparkazure-databricks

splitting array columns


table1:

rf_1 rf_2
['abc','a'] ['abce','cde']
['abc','cde'] ['1','2']

How can I split all columns start with rf into individual columns?

Final output should look like this:

rf_1a rf_1b rf_2a rf_2b
'abc' 'a' 'abcde' 'cde'
'abc' 'cde' 1 2

Solution

  • I have tried the below example in Pyspark:

    from pyspark.sql import SparkSession
    from pyspark.sql.functions import udf
    from pyspark.sql.types import ArrayType, StringType, IntegerType
    dilip_sample_data  = [("['abc','a']", "['abce','cde']"),
            ("['abc','cde']", "['1','2']")]
    initial_df = spark.createDataFrame(data, ["rf_1", "rf_2"])
    def extract_elements(s):
        s = s.strip("[]").replace("'", "").split(",")
        result = []
        for x in s:
            if x.isdigit():
                result.append(int(x))
            else:
                result.append(f"'{x}'")
        return result
    extract_elements_udf = udf(extract_elements, ArrayType(StringType()))
    dilip_final_df = df.withColumn("rf_1a", extract_elements_udf("rf_1")[0]) \
          .withColumn("rf_1b", extract_elements_udf("rf_1")[1]) \
          .withColumn("rf_2a", extract_elements_udf("rf_2")[0]) \
          .withColumn("rf_2b", extract_elements_udf("rf_2")[1])
    dilip_final_df.select("rf_1a", "rf_1b", "rf_2a", "rf_2b").show(truncate=False)
    
    +-----+-----+------+-----+
    |rf_1a|rf_1b|rf_2a |rf_2b|
    +-----+-----+------+-----+
    |'abc'|'a'  |'abce'|'cde'|
    |'abc'|'cde'|1     |2    |
    +-----+-----+------+-----+
    
    • I am defining sample data and creating an initial DataFrame

    • Defining a function to extract elements from a string

    • Creating a user-defined function (UDF) from the extract_elements function

    • Applying the UDF to the DataFrame to create new columns