I am cleaning string data using the matchmaker
package. I've created my dictionary with from
, to
, and col
columns to define the uncleaned terms, cleaned terms, and column names where the uncleaned terms in the file can be found.
The original data look something like this
ID <- 1:5
var1 <- c("aaa", "bbb", "ccc", "ddd", NA)
var2 <- c("ccc", "ddd", NA, NA,"aaa")
var3 <- c(NA, NA, "bbb", NA, "aaa")
df <- data.frame(ID, var1, var2, var3)
Here is what the dictionary like
from <- c("aaa", "bbb", "ccc", "ddd",
"ccc", "ddd", "aaa",
"bbb", "aaa")
to <- c("Aaa", "Bbb", "Ccc", "Ddd",
"Ccc", "Ddd", "Aaa",
"Bbb", "Aaa")
col <- c("var1", "var1", "var1", "var1",
"var2", "var2", "var2",
"var3", "var3")
dictionary <- data.frame(from, to, col)
I used the following code:
library(matchmaker)
match_df(df, dictionary = dictionary,
from = "from",
to = "to",
by = "col")
Here is the result I expected:
ID var1 var2 var3
1 1 Aaa Ccc <NA>
2 2 Bbb Ddd <NA>
3 3 Ccc <NA> Bbb
4 4 Ddd <NA> <NA>
5 5 <NA> Aaa Aaa
Here is the result I got
ID var1 var2 var3
1 1 aaa Ccc <NA>
2 2 bbb Ddd <NA>
3 3 ccc <NA> Bbb
4 4 ddd <NA> <NA>
5 5 <NA> Aaa Aaa
The code works for this example, but with real dataset I have it did not. Does anyone have any idea how to fix this? Thanks in advance.
To anyone who comes across this thread looking for a solution for a similar issue that results in the following message:
- NA Each element of '...' must be a named string.
check if your dictionary has any "NA" or blank in it. Once you remove the line with NA, your match_df()
command should work for all columns.