Search code examples
rfor-loopdplyrpurrrstringdist

Using dplyr::mutate to loop through all available methods in stringdist


I am doing some fuzzy text matching to match school names. Here is an example of my data, which is two columns in a tibble:

data <- tibble(school1 = c("abilene christian", "abilene christian", "abilene christian", "abilene christian"),
               school2 = c("a t still university of health sciences", "abilene christian university", "abraham baldwin agricultural college", "academy for five element acupuncture"))
data
# A tibble: 4 x 2
school1           school2                                
  <chr>             <chr>                                  
1 abilene christian a t still university of health sciences
2 abilene christian abilene christian university           
3 abilene christian abraham baldwin agricultural college   
4 abilene christian academy for five element acupuncture 

What I would like to do is use stringdist to run through all the available methods and return a table that looks like this, where my original text remains in addition to a column for each method and the value returned:

# A tibble: 4 x 12
  school1           school2       osa    lv    dl hamming   lcs qgram cosine jaccard    jw soundex
  <chr>             <chr>       <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>  <dbl>   <dbl> <dbl>   <dbl>
1 abilene christian a t still …  29.0  29.0  29.0     Inf  36.0  24.0 0.189    0.353 0.442    1.00
2 abilene christian abilene ch…  11.0  11.0  11.0     Inf  11.0  11.0 0.0456   0.200 0.131    0   
3 abilene christian abraham ba…  28.0  28.0  28.0     Inf  35.0  25.0 0.274    0.389 0.431    1.00
4 abilene christian academy fo…  28.0  28.0  28.0     Inf  37.0  29.0 0.333    0.550 0.445    1.00

I can get this to work using a for loop using the following:

  method_list <- c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex")
    for (i in method_list) {
  data[, i] <- stringdist(data$school1, data$school2, method = i)
}

What I would like to do it convert this into the more readable dplyr syntax, but I can't get the loop to work with mutate. Here is what I have:

for (i in method_list) {
      ft_result <- data %>% 
                     mutate(i = stringdist(school1, school2, method = i))            
    }

Running this returns 1 additional column added to my original data called "i" with a value of 1 for every row.

Question 1: Is a for-loop the best way to accomplish what I am trying to get to? I looked at purrr to see if I could use something like map or invoke, but I don't think any of those functions do what I want.

Question 2: If a for-loop is the way to go, how can I make it work with mutate? I tried using mutate_at, but that didn't work either.


Solution

  • This seems like a great place to use purrr::map_dfc

    General idea here is to map through the function passing each method as an input and wrapping the result in a dataframe. purrr::set_names also comes in handy.


    library(tidyverse)
    library(stringdist)
    
    method_list <- c("osa", "lv", "dl", "hamming", "lcs", "qgram",
                     "cosine", "jaccard", "jw", "soundex")
    
    tb <- starwars[c("name", "homeworld")]
    
    method_list %>%
      map_dfc(function(str_method) {
        data_frame(stringdist(tb$name, tb$homeworld, method = str_method))
        }
      ) %>%
      set_names(method_list) %>%
      bind_cols(tb, .)
    #> Warning in do_dist(a = b, b = a, method = method, weight = weight, maxDist
    #> = maxDist, : Non-printable ascii or non-ascii characters in soundex.
    #> Results may be unreliable. See ?printable_ascii.
    #> # A tibble: 87 x 12
    #>                  name homeworld   osa    lv    dl hamming   lcs qgram
    #>                 <chr>     <chr> <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>
    #>  1     Luke Skywalker  Tatooine    13    13    13     Inf    18    18
    #>  2              C-3PO  Tatooine     8     8     8     Inf    13    13
    #>  3              R2-D2     Naboo     5     5     5       5    10    10
    #>  4        Darth Vader  Tatooine     8     8     8     Inf    13    13
    #>  5        Leia Organa  Alderaan     8     8     8     Inf    11     9
    #>  6          Owen Lars  Tatooine     9     9     9     Inf    15    11
    #>  7 Beru Whitesun lars  Tatooine    16    16    16     Inf    22    16
    #>  8              R5-D4  Tatooine     8     8     8     Inf    13    13
    #>  9  Biggs Darklighter  Tatooine    14    14    14     Inf    19    17
    #> 10     Obi-Wan Kenobi   Stewjon    13    13    13     Inf    17    15
    #> # ... with 77 more rows, and 4 more variables: cosine <dbl>,
    #> #   jaccard <dbl>, jw <dbl>, soundex <dbl>