I have a dataframe like below.
+-------+------+-------+-------+
| num1 | num2 | x | y |
+-------+------+-------+-------+
| 25 | 10 | a&c | i&j&k |
| 35 | 15 | a&b&d | i&k |
+-------+------+-------+-------+
I have another data frame structure with the headers like,
num1, num2, a, b, c, d, i, j, k
I want to split the column data of x and y from the symbol "&". Then check whether the split data are matching with the headers above, also considering the columns num1 and num2. If it so fill the values with 1 else with 0.
The required output is:
+-------+------+---+---+---+---+---+---+---+
| num1 | num2 | a | b | c | d | i | j | k |
+-------+------+---+---+---+---+---+---+---+
| 25 | 10 | 1 | 0 | 1 | 0 | 1 | 1 | 1 |
| 35 | 15 | 1 | 1 | 0 | 1 | 1 | 0 | 1 |
+-------+------+---+---+---+---+---+---+---+
I have achieved the above output in a method like following. I created another data frame same like the first data frame but the x and y contains with an array of split data like following.
+------+-------+---------+---------+
| num1 | num2 | x | y |
+------+-------+---------+---------+
| 25 | 10 | [a,c] | [i,j,k] |
| 35 | 15 | [a,b,d] | [i,k] |
+------+-------+---------+---------+
Then followed the solution in this question
Although it gives me the exact solution, it is ineffective when it comes to the case where there are lot of columns like x and y.
So now I want to create a case class and match the header values with the data in x,y columns by splitting them to a list. Is it possible or is there any other solution? Can someone help me?
After trying several methods at last I came up with the following solution. I found my solution by adding some few changes to the answer for this question: Compare rows of an array column with the headers of another data frame using Scala and Spark. It worked for multiple array columns also. This is the code for it.
val df = Seq((25, 10, "a&c", "i&j&k"), (35, 15, "a&b&d", "i&k")
.toDF("num1", "num2", "x", "y")
val dfProcessed = df.withColumn("x", split($"x", "&"))
.withColumn("y", split($"y", "&"))
.select("num1", "num2", "x", "y")
val headers = Seq("a", "b", "c", "d", "i", "j", "k")
val report = dfProcessed.select(Seq("num1", "num2").map(col) ++ headers.map(line => array_contains('x, line)
|| array_contains('y, line) as line) : _*)
report.show()
I think this may help you.