I have 2 Spark dataframes
library(SparkR); library(magrittr)
df1 <- createDataFrame(data.frame(var1 = c("rat", "cat", "bat")))
df2 <- createDataFrame(data.frame(var2 = c("cat3", "bat1", "dog", "toy")))
I need to fuzzy match different lengthed var1 and var2 from different Spark DataFrames df1 and df2 using SparkR's levenshtein function so that the desired output becomes.
desired_df <- createDataFrame(data.frame(var2 = c("cat3", "bat1", "dog", "toy"),
var3 = c("cat", "bat", NA_character_, NA_character_)))
I started with the following code:
df3 <- df2 %>% SparkR::mutate(dist = levenshtein(df2$var2, df1$var1))
but got stuck with errors:
org.apache.spark.sql.AnalysisException: Resolved attribute(s) var1#176 missing from var2#178 in operator !Project [var2#178, levenshtein(var2#178, var1#176) AS dist#181].;;
!Project [var2#178, levenshtein(var2#178, var1#176) AS dist#181]
Please advise.
Your mistake is to reference columns from a table not present in the execution plan.
Adding crossJoin
would fix that:
dist_df <- df1 %>%
crossJoin(df2) %>%
withColumn("dist", levenshtein(df1$var1, df2$var2))
dist_df %>% head()
var1 var2 dist
1 rat cat3 2
2 rat bat1 2
3 rat dog 3
4 rat toy 3
5 cat cat3 1
6 cat bat1 2
From here you could use standard methods (How to select the first row of each group?) to find the closest match, for example:
best_matches <- dist_df %>%
groupBy("var2") %>%
agg(struct(dist_df$dist, dist_df$var1) %>% min() %>% alias("match"))
threshold <- 1 # Maximum match distance to keep
result <- best_matches %>%
select(
best_matches$var2,
when(best_matches$match.dist <= threshold, best_matches$match.var1) %>%
alias("var1"))
result %>% head()
var2 var1
1 dog <NA>
2 bat1 bat
3 cat3 cat
4 toy <NA>
Please keep in mind that this approach is very inefficient. Spark provides better options (Efficient string matching in Apache Spark), but these are not exposed in SparkR yet, and only partially implemented in sparklyr
.
If you want to keep all records, not matter the quality, just remove when
:
best_matches %>% select(best_matches$var2, best_matches$match.var1) %>% head()
var2 var1
1 dog bat
2 bat1 bat
3 cat3 cat
4 toy bat