Search code examples
regexscalaapache-sparkudf

Scala Regex UDF to grab query parameter values and transform them into comma delimited list


I have data that looks similar to the following:

one=1&two=22222&three=&four=4f4

As you can see, the value for the variable three is missing. I would like to use Scala Regex to grab all the values and return them comma delimited.

Desired Output:

1,22222,,4f4

Another, More Desired, Possible Output:

1,22222,undefined,4f4

This is my current code (I am using scala with Spark 2.0 for a dataframe):

def main(args: Array[String]) {
  ...
  val pattern : scala.util.matching.Regex = """[^&?]*?=([^&?]*)""".r
  df.select(transform(pattern)($"data").alias("csvData")).take(100).foreach(println)
}

def transform(pattern: scala.util.matching.Regex) = udf(
 (dataMapping: String) => pattern.findAllIn(dataMapping).toList
)

Which returns:

[WrappedArray(one=1, two=22222, three=, four=4f4)]
[WrappedArray(...)]

I think I can do better on my "transform" udf function, but I am very new to Scala and am unsure of how to just match the first groups and return them comma separated. I would guess I would use something like m => m.group(1) in my solution, but I'm not sure. Thank you for your suggestions.


Solution

  • If you have multiple columns you would probably be best off using a UDF:

    scala> val df = Seq(("one=1&two=22222&three=&four=4f4", 1)).toDF("a", "b")
    df: org.apache.spark.sql.DataFrame = [a: string, b: int]
    
    scala> df.show
    +--------------------+---+
    |                   a|  b|
    +--------------------+---+
    |one=1&two=22222&t...|  1|
    +--------------------+---+
    
    scala> val p = """[one|two|three|four]\=([\d|\W|\w]+)""".r
    p: scala.util.matching.Regex = [one|two|three|four]\=([\d|\W|\w]+)
    
    scala> :pa
    // Entering paste mode (ctrl-D to finish)
    
    val regexUDF = udf( (x: String) =>
        x.split("&").map(p.findFirstMatchIn(_).map(_.group(1)).getOrElse(null)))
        )
    
    // Exiting paste mode, now interpreting.
    
    regexUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(StringType,true),Some(List(StringType)))
    
    scala> val df2 = df.withColumn("a", regexUDF($"a"))
    df2: org.apache.spark.sql.DataFrame = [a: array<string>, b: int]   
    
    scala> df2.show
    +--------------------+---+
    |                   a|  b|
    +--------------------+---+
    |[1, 22222, null, ...|  1|
    +--------------------+---+
    
    
    scala> df2.collect.foreach{println}
    [WrappedArray(1, 22222, null, 4f4),1]