dataframe from hive table to iterate through each element for some operation and write in df,rdd,list

I have a DF with input data as below:

+----+----+
|col1|col2|
+----+--------+
| abc|2E2J2K2F|
| bcd|    2K3D|
+----+--------+

My expected expected output is:

+-----+-----+
| col1| col2|
+----+------+
| abc|    2E|
| abc|    2J|
| abc|    2K|
| abc|    2F|
| bcd|    2K|
| bcd|    3D|
+----+------+
+----+------+

Solution

Use udf() for splitting the string and then explode it. Check this out:

scala>  val df = Seq(("abc","2E2J2K2F"),("bcd","2K3D")).toDF("col1","col2")
df: org.apache.spark.sql.DataFrame = [col1: string, col2: string]

scala> def split2(x:String):Array[String] = x.sliding(2,2).toArray
split2: (x: String)Array[String]

scala> val myudf_split2 = udf ( split2(_:String):Array[String] )
myudf_split2: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(StringType,true),Some(List(StringType)))

scala> df.withColumn("newcol",explode(myudf_split2('col2))).select("col1","newcol").show
+----+------+
|col1|newcol|
+----+------+
| abc|    2E|
| abc|    2J|
| abc|    2K|
| abc|    2F|
| bcd|    2K|
| bcd|    3D|
+----+------+


scala>

Update:

the split2() is just splitting the string by 2 bytes each and creating an array. The explode functions duplicates the row based on the length of the array, giving each index value for all the rows.

scala> def split2(x:String):Array[String] = x.sliding(2,2).toArray
split2: (x: String)Array[String]

scala> split2("12345678")
res168: Array[String] = Array(12, 34, 56, 78)

scala> def split2(x:String):Array[String] = x.sliding(2,2).toArray
split2: (x: String)Array[String]

scala> split2("12345678")
res168: Array[String] = Array(12, 34, 56, 78)

scala> "12345678".sliding(4,4).toArray
res171: Array[String] = Array(1234, 5678)