Search code examples
arraysscaladataframeapache-sparkapache-spark-2.3

How to convert array of array (string type) to struct - Spark/Scala?


I have a dataframe as

+---------------------------------------------------------------+---+
|family_name                                                    |id |
+---------------------------------------------------------------+---+
|[[John, Doe, Married, 999-999-9999],[Jane, Doe, Married,Wife,]]|id1|
|[[Tom, Riddle, Single, 888-888-8888]]                          |id2|
+---------------------------------------------------------------+---+
root
 |-- family_name: string (nullable = true)
 |-- id: string (nullable = true)

I wish to convert the column fam_name to array of named structs as

`family_name` array<struct<f_name:string,l_name:string,status:string,ph_no:string>>

Im able to convert family_name to array as shown below

val sch = ArrayType(ArrayType(StringType))

val fam_array = data
        .withColumn("family_name_clean", regexp_replace($"family_name", "\\[\\[", "["))
        .withColumn("family_name_clean_clean1", regexp_replace($"family_name_clean", "\\]\\]", "]"))
        .withColumn("ar", toArray($"family_name_clean_clean1"))
        //.withColumn("ar1", from_json($"ar", sch))
    fam_array.show(false)
    fam_array.printSchema()

+---------------------------------------------------------------+---+--------------------------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------------------+
|family_name                                                    |id |family_name_clean                                             |family_name_clean_clean1                                     |ar                                                                     |
+---------------------------------------------------------------+---+--------------------------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------------------+
|[[John, Doe, Married, 999-999-9999],[Jane, Doe, Married,Wife,]]|id1|[John, Doe, Married, 999-999-9999],[Jane, Doe, Married,Wife,]]|[John, Doe, Married, 999-999-9999],[Jane, Doe, Married,Wife,]|[[John,  Doe,  Married,  999-999-9999], [Jane,  Doe,  Married, Wife, ]]|
|[[Tom, Riddle, Single, 888-888-8888]]                          |id2|[Tom, Riddle, Single, 888-888-8888]]                          |[Tom, Riddle, Single, 888-888-8888]                          |[[Tom,  Riddle,  Single,  888-888-8888]]                               |
+---------------------------------------------------------------+---+--------------------------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------------------+
root
 |-- family_name: string (nullable = true)
 |-- id: string (nullable = true)
 |-- family_name_clean: string (nullable = true)
 |-- family_name_clean_clean1: string (nullable = true)
 |-- ar: array (nullable = true)
 |    |-- element: string (containsNull = true)

 

sch is a schema variable of desired type.

How do I convert column ar to array<struct<>> ?

EDIT:

I'm using Spark 2.3.2


Solution

  • To create an array of structs given an array of arrays of strings, you can use struct function to build a struct given a list of columns combined with element_at function to extract column element at a specific index of an array.

    To solve your specific problem, as you correctly stated you need to do two things:

    • First, transform your string to an array of arrays of strings
    • Then, use this array of arrays of strings to build your struct

    In Spark 3.0 and greater

    Using Spark 3.0, we can perform all those steps using spark built-in functions.

    For the first step, I would do as follows:

    • first remove [[ and ]] from family_name string using regexp_replace function
    • then, create first array level by splitting this string using split function
    • then, create second array level by splitting each element of previous array using transform and split functions

    And for the second step, use struct function to build a struct, picking element in arrays using element_at function.

    Thus, complete code using Spark 3.0 and greater would be as follows, with data as input dataframe:

    import org.apache.spark.sql.functions.{col, element_at, regexp_replace, split, struct, transform}
    
    val result = data
      .withColumn(
        "family_name", 
        transform( 
          split( // first level split
            regexp_replace(col("family_name"), "\\[\\[|]]", ""), // remove [[ and ]]
            "],\\["
          ), 
          x => split(x, ",") // split for each element in first level array
        )
      )
      .withColumn("family_name", transform(col("family_name"), x => struct(
        element_at(x, 1).as("f_name"), // index starts at 1
        element_at(x, 2).as("l_name"),
        element_at(x, 3).as("status"),
        element_at(x, -1).as("ph_no"), // get last element of array
      )))
    

    In Spark 2.X

    Using Spark 2.X, we have to rely on an user-defined function. First, we need to define a case class that represent our struct:

    case class FamilyName(
      f_name: String, 
      l_name: String, 
      status: String, 
      ph_no: String
    )
    

    Then, we define our user-defined function and apply it to our input dataframe:

    import org.apache.spark.sql.functions.{col, udf}
    
    val extract_array = udf((familyName: String) => familyName
      .replaceAll("\\[\\[|]]", "")
      .split("],\\[")
      .map(familyName => {
        val explodedFamilyName = familyName.split(",", -1)
        FamilyName(
          f_name = explodedFamilyName(0),
          l_name = explodedFamilyName(1),
          status = explodedFamilyName(2),
          ph_no = explodedFamilyName(explodedFamilyName.length - 1)
        )
      })
    )
    
    val result = data.withColumn("family_name", extract_array(col("family_name")))
    

    Result

    If you have the following data dataframe:

    +---------------------------------------------------------------+---+
    |family_name                                                    |id |
    +---------------------------------------------------------------+---+
    |[[John, Doe, Married, 999-999-9999],[Jane, Doe, Married,Wife,]]|id1|
    |[[Tom, Riddle, Single, 888-888-8888]]                          |id2|
    +---------------------------------------------------------------+---+
    

    You get the following result dataframe:

    +-----------------------------------------------------------------+---+
    |family_name                                                      |id |
    +-----------------------------------------------------------------+---+
    |[{John,  Doe,  Married,  999-999-9999}, {Jane,  Doe,  Married, }]|id1|
    |[{Tom,  Riddle,  Single,  888-888-8888}]                         |id2|
    +-----------------------------------------------------------------+---+
    

    having the following schema:

    root
     |-- family_name: array (nullable = true)
     |    |-- element: struct (containsNull = false)
     |    |    |-- f_name: string (nullable = true)
     |    |    |-- l_name: string (nullable = true)
     |    |    |-- status: string (nullable = true)
     |    |    |-- ph_no: string (nullable = true)
     |-- id: string (nullable = true)