I am doing some fuzzy text matching to match school names. Here is an example of my data, which is two columns in a tibble:
data <- tibble(school1 = c("abilene christian", "abilene christian", "abilene christian", "abilene christian"),
school2 = c("a t still university of health sciences", "abilene christian university", "abraham baldwin agricultural college", "academy for five element acupuncture"))
data
# A tibble: 4 x 2
school1 school2
<chr> <chr>
1 abilene christian a t still university of health sciences
2 abilene christian abilene christian university
3 abilene christian abraham baldwin agricultural college
4 abilene christian academy for five element acupuncture
What I would like to do is use stringdist
to run through all the available methods
and return a table that looks like this, where my original text remains in addition to a column for each method and the value returned:
# A tibble: 4 x 12
school1 school2 osa lv dl hamming lcs qgram cosine jaccard jw soundex
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 abilene christian a t still … 29.0 29.0 29.0 Inf 36.0 24.0 0.189 0.353 0.442 1.00
2 abilene christian abilene ch… 11.0 11.0 11.0 Inf 11.0 11.0 0.0456 0.200 0.131 0
3 abilene christian abraham ba… 28.0 28.0 28.0 Inf 35.0 25.0 0.274 0.389 0.431 1.00
4 abilene christian academy fo… 28.0 28.0 28.0 Inf 37.0 29.0 0.333 0.550 0.445 1.00
I can get this to work using a for loop using the following:
method_list <- c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex")
for (i in method_list) {
data[, i] <- stringdist(data$school1, data$school2, method = i)
}
What I would like to do it convert this into the more readable dplyr syntax, but I can't get the loop to work with mutate. Here is what I have:
for (i in method_list) {
ft_result <- data %>%
mutate(i = stringdist(school1, school2, method = i))
}
Running this returns 1 additional column added to my original data called "i" with a value of 1 for every row.
Question 1: Is a for-loop the best way to accomplish what I am trying to get to? I looked at purrr to see if I could use something like map or invoke, but I don't think any of those functions do what I want.
Question 2: If a for-loop is the way to go, how can I make it work with mutate? I tried using mutate_at, but that didn't work either.
This seems like a great place to use purrr::map_dfc
General idea here is to map through the function passing each method as an input and wrapping the result in a dataframe. purrr::set_names
also comes in handy.
library(tidyverse)
library(stringdist)
method_list <- c("osa", "lv", "dl", "hamming", "lcs", "qgram",
"cosine", "jaccard", "jw", "soundex")
tb <- starwars[c("name", "homeworld")]
method_list %>%
map_dfc(function(str_method) {
data_frame(stringdist(tb$name, tb$homeworld, method = str_method))
}
) %>%
set_names(method_list) %>%
bind_cols(tb, .)
#> Warning in do_dist(a = b, b = a, method = method, weight = weight, maxDist
#> = maxDist, : Non-printable ascii or non-ascii characters in soundex.
#> Results may be unreliable. See ?printable_ascii.
#> # A tibble: 87 x 12
#> name homeworld osa lv dl hamming lcs qgram
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Luke Skywalker Tatooine 13 13 13 Inf 18 18
#> 2 C-3PO Tatooine 8 8 8 Inf 13 13
#> 3 R2-D2 Naboo 5 5 5 5 10 10
#> 4 Darth Vader Tatooine 8 8 8 Inf 13 13
#> 5 Leia Organa Alderaan 8 8 8 Inf 11 9
#> 6 Owen Lars Tatooine 9 9 9 Inf 15 11
#> 7 Beru Whitesun lars Tatooine 16 16 16 Inf 22 16
#> 8 R5-D4 Tatooine 8 8 8 Inf 13 13
#> 9 Biggs Darklighter Tatooine 14 14 14 Inf 19 17
#> 10 Obi-Wan Kenobi Stewjon 13 13 13 Inf 17 15
#> # ... with 77 more rows, and 4 more variables: cosine <dbl>,
#> # jaccard <dbl>, jw <dbl>, soundex <dbl>