scala apache-spark apache-spark-sql pattern-matching contains

Text containing exact string from list of strings

I'm trying to find all the texts that contain at least one keyword in the list of keywords given. This is similar to this answer: String Containing Exact Substring from Substring List

However, I need to expand it so that it can work with multiple words, for example matching 'united states' and not simply 'usa'.


val df = spark.createDataFrame(Seq(
  (1, "usa of america"),
  (2, "usa"),
  (4, "united states of america"),
  (5, "states"),
  (6, "united states")
)).toDF("id", "country")
df.registerTempTable("df")

val valid_names = Set("usa", "united states")

def udf_check_country(valid_words: Set[String]) = {  udf {(words: String) => words.split(" ").exists(valid_words.contains) } }

var df2 = df.withColumn("udf_check_country", udf_check_country(valid_names)($"country"))
df2.registerTempTable("df2")

df2.show()

Where I get the new column failing for the last case of 'united states'.


+---+--------------------+-----------------+
| id|             country|udf_check_country|
+---+--------------------+-----------------+
|  1|      usa of america|             true|
|  2|                 usa|             true|
|  4|united states of ...|            false|
|  5|              states|            false|
|  6|       united states|            false|
+---+--------------------+-----------------+

How can I make it work for keywords with multiple words?

Solution

Depending on your rules, you can simply add another another condition iterating your valid_names against the whole string, like:

valid_words.exists(words.contains) || words.split(" ").exists(valid_words.contains)

That will make that id 4 and 6 also return true.