Search code examples
rfunctionpurrrnested-function

Object not found in function environment for nested objects


I have a code snippet which I am trying to convert into a function. This function is supposed to look for potential spelling errors in a manual-entry field. The snippet works and you can try it out like this, using the starwars data from the tidyverse package:

require(tidyverse)
require(rlang)            # loaded for {{ to force function arguments as well as the with_env() function
require(RecordLinkage)    # loaded for the jarowinkler() function

starwars_cleaning <- starwars %>%
  add_count(name, name = "Freq_name") %>%   # this keeps track of which spelling is more frequent
  distinct(name, .keep_all = T) %>%         # this prevents duplicated comparisons and self-comparisons
  nest_by(homeworld, .key = ".Nest") %>%
  mutate(Mapped = list(imap_dfr(.x = .Nest$name,
                                .f = ~jarowinkler(str1 = .x,
                                                  str2 = .Nest$name[-.y]) %>% 
                                  list() %>% 
                                  tibble(Score_n = ., Match_n = list(.Nest$name[-.y]),
                                         Freq_n = list(.Nest$Freq_name[-.y]))
                                
  )))

The function should accept the variable(s) to nest on (ellipses) and the variable to look for potential misspelled matches in as arguments. Right now, it looks like this:

string_matching <- function(.df, .string_col, ...){
  .df$.tmp_string <- .df %>% select({{.string_col}})
  .df <- .df %>%
    add_count(.tmp_string, name = "Freq_name") %>%
    distinct(.tmp_string, .keep_all = T) %>% 
    nest_by(..., .key = ".Nest") %>%
    mutate(Mapped_n = list(with_env(env = current_env(),  # same error with or without specifying the execution environment for imap
                                    expr = imap_dfr(.x = .Nest$.tmp_string,
                                                    .f = ~jarowinkler(str1 = .x,
                                                                      str2 = .Nest$.tmp_string[-.y]) %>% 
                                                      list() %>% 
                                                      tibble(Score_n = ., Match_n = list(.Nest$.tmp_string[-.y]),
                                                             Freq_n = list(.Nest$Freq_name[-.y]))
                                                    )
                                    ))
           )
  return(.df)
}
starwars %>% 
  string_matching(name, homeworld)

On the starwars data, it isn't very useful, clearly. And I cut down some of the features of this code to get a MWE--but that's the idea. When I wrap the code up like this in a function, it returns invalid argument to unary operator (apparently caused by the [-.y]). I tried the force() command after reading this post since this problem apparently comes up a lot. Because of the current error and that post, I thought the problem might have to do with the function environment causing imap_dfr() to lose track of the data somehow. I tried to wrap the call to map in with_env() and instruct it to use the function environment rather than its own. I also tried to break up the function by assigning an intermediate object to the global environment so that it could be found in the mapping step of the function:

assign(x = "TEMP", value = .df$.Nest, envir = global_env())

That landed me with the same 'unary operator` error. I'm not sure what to try next. I seem to be going in circles. Any insights into what is causing this problem and how to fix it would be greatly appreciated.


Solution

  • I don't think the post you pointed to is really related here. I don't think your problem is related to execution environment. The problem really is how you've handled passing the variable to your function. When you create your tmp_string, you are calling select() which is returning a tibble rather than the vector of column values. Instead, use pull() to extract those values.

    string_matching <- function(.df, .string_col, ...){
      .df$.tmp_string <- .df %>% pull({{.string_col}})
      .df <- .df %>%
        add_count(.tmp_string, name = "Freq_name") %>%
        distinct(.tmp_string, .keep_all = T) %>% 
        nest_by(..., .key = ".Nest") %>%
        mutate(Mapped_n = list(with_env(env = current_env(),  # same error with or without specifying the execution environment for imap
                                        expr = imap_dfr(.x = .Nest$.tmp_string,
                                                        .f = ~jarowinkler(str1 = .x,
                                                                          str2 = .Nest$.tmp_string[-.y]) %>% 
                                                          list() %>% 
                                                          tibble(Score_n = ., Match_n = list(.Nest$.tmp_string[-.y]),
                                                                 Freq_n = list(.Nest$Freq_name[-.y]))
                                                        )
                                        ))
               )
      return(.df)
    }
    

    Or you could write your code to avoid the need for that temp column completely

    string_matching <- function(.df, .string_col, ...){
      col <- rlang::ensym(.string_col)
      .df <- .df %>%
        add_count(!!col, name = "Freq_name") %>%
        distinct(!!col, .keep_all = T) %>% 
        nest_by(..., .key = ".Nest") %>%
        mutate(Mapped_n = list(imap_dfr(.x = .Nest %>% pull(!!col),
                                                        .f = ~jarowinkler(str1 = .x,
                                                                   str2 = (.Nest %>% pull(col))[-.y]) %>% 
                                                          list() %>% 
                                                          tibble(Score_n = ., Match_n = list((.Nest %>% pull(col))[-.y]),
                                                                 Freq_n = list(.Nest$Freq_name[-.y]))
                                        ))
        )
      return(.df)
    }