Search code examples
rstringdata-cleaning

match_df from matchmaker does not work on all columns?


I am cleaning string data using the matchmaker package. I've created my dictionary with from, to, and col columns to define the uncleaned terms, cleaned terms, and column names where the uncleaned terms in the file can be found.

The original data look something like this

ID <- 1:5
var1 <- c("aaa", "bbb", "ccc", "ddd", NA)
var2 <- c("ccc", "ddd", NA, NA,"aaa")
var3 <- c(NA, NA, "bbb", NA, "aaa")

df <- data.frame(ID, var1, var2, var3)

Here is what the dictionary like

from <- c("aaa", "bbb", "ccc", "ddd", 
          "ccc", "ddd", "aaa", 
          "bbb", "aaa")
to <- c("Aaa", "Bbb", "Ccc", "Ddd", 
        "Ccc", "Ddd", "Aaa", 
        "Bbb", "Aaa")
col <- c("var1", "var1", "var1", "var1", 
         "var2", "var2", "var2", 
         "var3", "var3")

dictionary <- data.frame(from, to, col)

I used the following code:

library(matchmaker)
match_df(df, dictionary = dictionary, 
              from = "from", 
              to = "to", 
              by = "col")

Here is the result I expected:

  ID var1 var2 var3
1  1  Aaa  Ccc <NA>
2  2  Bbb  Ddd <NA>
3  3  Ccc <NA>  Bbb
4  4  Ddd <NA> <NA>
5  5 <NA>  Aaa  Aaa

Here is the result I got

  ID var1 var2 var3
1  1  aaa  Ccc <NA>
2  2  bbb  Ddd <NA>
3  3  ccc <NA>  Bbb
4  4  ddd <NA> <NA>
5  5 <NA>  Aaa  Aaa

The code works for this example, but with real dataset I have it did not. Does anyone have any idea how to fix this? Thanks in advance.


Solution

  • To anyone who comes across this thread looking for a solution for a similar issue that results in the following message:

    1. NA Each element of '...' must be a named string.

    check if your dictionary has any "NA" or blank in it. Once you remove the line with NA, your match_df() command should work for all columns.