Search code examples
pythonpysparkdatabricksazure-databricks

Databricks: extract value between " " in an array


I have a column called "trigger_piece", it is organized in below way: ["arc", "erf", "erear"]

I'm looking to create new columns (in this case 3 new columns eg: trigger_1, trigger_2, trigger_3) that would extract "arc", "era", "erear".

I have tried using str.split, however it seems like that could be a pretty manual process since we don't know how many trigger extractions there can be. In above example we have 3 but it's also possible to have 4 or 5.


Solution

  • you can use below code.

    Below are the data having 3,4 and 5 variable lengths of array.

    enter image description here

    Code:

    max_triggers = df.selectExpr("size(trigger_piece)").agg({"size(trigger_piece)": "max"}).collect()[0]["max(size(trigger_piece))"]
    print(max_triggers)
    exprs = [col("trigger_piece")[i].alias(f"trigger_{i+1}") for i in  range(max_triggers)]
    result_df = df.select("*", *exprs)
    display(result_df)
    

    Output:

    enter image description here

    Whatever the size it creates new column with it, and for lesser triggers it assigns nulls.