Search code examples
scalalistapache-sparkdataframerdd

dataframe from hive table to iterate through each element for some operation and write in df,rdd,list


I have a DF with input data as below:

+----+----+
|col1|col2|
+----+--------+
| abc|2E2J2K2F|
| bcd|    2K3D|
+----+--------+

My expected expected output is:

+-----+-----+
| col1| col2|
+----+------+
| abc|    2E|
| abc|    2J|
| abc|    2K|
| abc|    2F|
| bcd|    2K|
| bcd|    3D|
+----+------+
+----+------+

Solution

  • Use udf() for splitting the string and then explode it. Check this out:

    scala>  val df = Seq(("abc","2E2J2K2F"),("bcd","2K3D")).toDF("col1","col2")
    df: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
    
    scala> def split2(x:String):Array[String] = x.sliding(2,2).toArray
    split2: (x: String)Array[String]
    
    scala> val myudf_split2 = udf ( split2(_:String):Array[String] )
    myudf_split2: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(StringType,true),Some(List(StringType)))
    
    scala> df.withColumn("newcol",explode(myudf_split2('col2))).select("col1","newcol").show
    +----+------+
    |col1|newcol|
    +----+------+
    | abc|    2E|
    | abc|    2J|
    | abc|    2K|
    | abc|    2F|
    | bcd|    2K|
    | bcd|    3D|
    +----+------+
    
    
    scala>
    

    Update:

    the split2() is just splitting the string by 2 bytes each and creating an array. The explode functions duplicates the row based on the length of the array, giving each index value for all the rows.

    scala> def split2(x:String):Array[String] = x.sliding(2,2).toArray
    split2: (x: String)Array[String]
    
    scala> split2("12345678")
    res168: Array[String] = Array(12, 34, 56, 78)
    
    scala> def split2(x:String):Array[String] = x.sliding(2,2).toArray
    split2: (x: String)Array[String]
    
    scala> split2("12345678")
    res168: Array[String] = Array(12, 34, 56, 78)
    
    scala> "12345678".sliding(4,4).toArray
    res171: Array[String] = Array(1234, 5678)