Search code examples
scalaapache-sparkexplode

How to explode in spark with delimiter


I have a table :

id itemNames coupons 1 item (foo bar) is available, soaps true 2 item (bar) is available false 3 soaps, shampoo false 4 item (foo bar, bar) is available true 5 item (foo bar, bar) is available, (soap, shampoo) true 6 null false

I want to explode this to

id itemNames coupons 1 item (foo bar) is available true
1 soaps true 2 item (bar) is available false 3 soaps false 3 shame false 4 item (foo bar, bar) is available true 5 item (foo bar, bar) is available true 6 (soap, shampoo) true
6 null true

when I do :

 df.withColumn("itemNames", explode(split($"itemNames", "[,]")))

I am getting :

itemNames                                          coupons
item (foo bar) is available                        true       
soaps                                              true 
item (bar) is available                            false
soaps                                              false
shampoo                                            false
item (foo bar,                                     true
bar) is available                                  true 
(soap,                                             true    
shampoo)                                           true

Can someone tell me what I am doing wrong and how can I correct this ? The one pattern common here is the comma appears inside ().


Solution

  • With UDF and inspired by Regex to match only commas not in parentheses? :

    val df = List(
      ("item (foo bar) is available, soaps", true),
      ("item (bar) is available", false),
      ("soaps, shampoo", false),
      ("item (foo bar, bar) is available", true),
      ("item (foo bar, bar) is available, (soap, shampoo)", true)
    ).
      toDF("itemNames", "coupons")
    df.show(false)
    
    val regex = Pattern.compile(
      ",         # Match a comma\n" +
        "(?!       # only if it's not followed by...\n" +
        " [^(]*    #   any number of characters except opening parens\n" +
        " \\)      #   followed by a closing parens\n" +
        ")         # End of lookahead",
      Pattern.COMMENTS)
    
    val customSplit = (value: String) => regex.split(value)
    val customSplitUDF = udf(customSplit)
    val result = df.withColumn("itemNames", explode(customSplitUDF($"itemNames")))
    result.show(false)
    

    Output is:

    +--------------------------------+-------+
    |itemNames                       |coupons|
    +--------------------------------+-------+
    |item (foo bar) is available     |true   |
    | soaps                          |true   |
    |item (bar) is available         |false  |
    |soaps                           |false  |
    | shampoo                        |false  |
    |item (foo bar, bar) is available|true   |
    |item (foo bar, bar) is available|true   |
    | (soap, shampoo)                |true   |
    +--------------------------------+-------+
    

    If "trim" is required, can be added to "customSplit" easly.