I have a table :
id itemNames coupons 1 item (foo bar) is available, soaps true 2 item (bar) is available false 3 soaps, shampoo false 4 item (foo bar, bar) is available true 5 item (foo bar, bar) is available, (soap, shampoo) true 6 null false
I want to explode this to
id itemNames coupons
1 item (foo bar) is available true
1 soaps true
2 item (bar) is available false
3 soaps false
3 shame false
4 item (foo bar, bar) is available true
5 item (foo bar, bar) is available true
6 (soap, shampoo) true
6 null true
when I do :
df.withColumn("itemNames", explode(split($"itemNames", "[,]")))
I am getting :
itemNames coupons
item (foo bar) is available true
soaps true
item (bar) is available false
soaps false
shampoo false
item (foo bar, true
bar) is available true
(soap, true
shampoo) true
Can someone tell me what I am doing wrong and how can I correct this ? The one pattern common here is the comma appears inside ().
With UDF and inspired by Regex to match only commas not in parentheses? :
val df = List(
("item (foo bar) is available, soaps", true),
("item (bar) is available", false),
("soaps, shampoo", false),
("item (foo bar, bar) is available", true),
("item (foo bar, bar) is available, (soap, shampoo)", true)
).
toDF("itemNames", "coupons")
df.show(false)
val regex = Pattern.compile(
", # Match a comma\n" +
"(?! # only if it's not followed by...\n" +
" [^(]* # any number of characters except opening parens\n" +
" \\) # followed by a closing parens\n" +
") # End of lookahead",
Pattern.COMMENTS)
val customSplit = (value: String) => regex.split(value)
val customSplitUDF = udf(customSplit)
val result = df.withColumn("itemNames", explode(customSplitUDF($"itemNames")))
result.show(false)
Output is:
+--------------------------------+-------+
|itemNames |coupons|
+--------------------------------+-------+
|item (foo bar) is available |true |
| soaps |true |
|item (bar) is available |false |
|soaps |false |
| shampoo |false |
|item (foo bar, bar) is available|true |
|item (foo bar, bar) is available|true |
| (soap, shampoo) |true |
+--------------------------------+-------+
If "trim" is required, can be added to "customSplit" easly.