I'm trying to find all the texts that contain at least one keyword in the list of keywords given. This is similar to this answer: String Containing Exact Substring from Substring List
However, I need to expand it so that it can work with multiple words, for example matching 'united states' and not simply 'usa'.
val df = spark.createDataFrame(Seq(
(1, "usa of america"),
(2, "usa"),
(4, "united states of america"),
(5, "states"),
(6, "united states")
)).toDF("id", "country")
df.registerTempTable("df")
val valid_names = Set("usa", "united states")
def udf_check_country(valid_words: Set[String]) = { udf {(words: String) => words.split(" ").exists(valid_words.contains) } }
var df2 = df.withColumn("udf_check_country", udf_check_country(valid_names)($"country"))
df2.registerTempTable("df2")
df2.show()
Where I get the new column failing for the last case of 'united states'.
+---+--------------------+-----------------+
| id| country|udf_check_country|
+---+--------------------+-----------------+
| 1| usa of america| true|
| 2| usa| true|
| 4|united states of ...| false|
| 5| states| false|
| 6| united states| false|
+---+--------------------+-----------------+
How can I make it work for keywords with multiple words?
Depending on your rules, you can simply add another another condition iterating your valid_names
against the whole string, like:
valid_words.exists(words.contains) || words.split(" ").exists(valid_words.contains)
That will make that id
4 and 6 also return true
.