I have a DF
with input data as below:
+----+----+
|col1|col2|
+----+--------+
| abc|2E2J2K2F|
| bcd| 2K3D|
+----+--------+
My expected expected output is:
+-----+-----+
| col1| col2|
+----+------+
| abc| 2E|
| abc| 2J|
| abc| 2K|
| abc| 2F|
| bcd| 2K|
| bcd| 3D|
+----+------+
+----+------+
Use udf() for splitting the string and then explode it. Check this out:
scala> val df = Seq(("abc","2E2J2K2F"),("bcd","2K3D")).toDF("col1","col2")
df: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
scala> def split2(x:String):Array[String] = x.sliding(2,2).toArray
split2: (x: String)Array[String]
scala> val myudf_split2 = udf ( split2(_:String):Array[String] )
myudf_split2: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(StringType,true),Some(List(StringType)))
scala> df.withColumn("newcol",explode(myudf_split2('col2))).select("col1","newcol").show
+----+------+
|col1|newcol|
+----+------+
| abc| 2E|
| abc| 2J|
| abc| 2K|
| abc| 2F|
| bcd| 2K|
| bcd| 3D|
+----+------+
scala>
Update:
the split2() is just splitting the string by 2 bytes each and creating an array. The explode functions duplicates the row based on the length of the array, giving each index value for all the rows.
scala> def split2(x:String):Array[String] = x.sliding(2,2).toArray
split2: (x: String)Array[String]
scala> split2("12345678")
res168: Array[String] = Array(12, 34, 56, 78)
scala> def split2(x:String):Array[String] = x.sliding(2,2).toArray
split2: (x: String)Array[String]
scala> split2("12345678")
res168: Array[String] = Array(12, 34, 56, 78)
scala> "12345678".sliding(4,4).toArray
res171: Array[String] = Array(1234, 5678)