I have the following data frame called result
.
MANUAL.ID | AUTO.ID | loc |
---|---|---|
NA | PYPPYP | L2 |
PIPpip | NA | L1 |
Barbar | NA | L5 |
NA | Pippip | L3 |
NA | Pippip,BerBar | L3 |
I try to replace all the different forms of a same tag by the right one. For example replace all PIPPIP and PIPpip by Pippip or Berbar by Barbar. To do this, I use a mutate
function with case_when
based on a required file called tesaurus
which have column with all the possible case of a same tag (tag_id) and a column with the correct one (tag_ok) which looks like this :
tag_id | tag_ok |
---|---|
PYPPYP | Pippip |
PIPpip | Pippip |
Pippip | Pippip |
Barbar | Barbar |
I use the following code.
library("plyr")
library("dplyr")
library("reshape")
library("data.table")
library("stringr")
library("tidyr")
library("openxlsx")
tesaurus <- read.xlsx("Requested_files/sp_tesaurus.xlsx", sheet = "tesaurus") %>%
select(-bat_sp)
result <- result %>%
mutate_at(
vars(contains("ID")),
list(as.formula(
paste0(
"~ case_when(",
paste(
sprintf(
"str_detect(., pattern = regex('%s', ignore_case = TRUE)) ~ '%s'",
tesaurus$tag_id,
tesaurus$tag_ok
),
collapse = ", "
), ", TRUE ~ .)"
)
))
)
My problem is that when I run it I have this error message :
Error in `mutate()`:
ℹ In argument: `AUTO.ID = (structure(function (..., .x = ..1, .y = ..2, . = ..1) ...`.
Caused by error in `case_when()`:
! Can't combine `..1 (right)` <character> and `..161 (right)` <double>.
---
Backtrace:
▆
1. ├─result %>% ...
2. ├─dplyr::mutate_at(...)
3. │ ├─dplyr::mutate(.tbl, !!!funs)
4. │ └─dplyr:::mutate.data.frame(.tbl, !!!funs)
5. │ └─dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
6. │ ├─base::withCallingHandlers(...)
7. │ └─dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
8. │ └─mask$eval_all_mutate(quo)
9. │ └─dplyr (local) eval()
10. ├─`<inln_cl_>`(AUTO.ID)
11. │ └─rlang::eval_bare(`_quo`, base::parent.frame())
12. ├─rlang (local) case_when(...)
13. └─dplyr::case_when(...)
So, what is the problem and how can I fix it ?
I think a join/coalesce method will work here. Updated with your new data (i.e., "PYPPYP"
to "Pippip"
):
result %>%
left_join(tesaurus, by = c(MANUAL.ID = "tag_id")) %>%
mutate(MANUAL.ID = coalesce(tag_ok, MANUAL.ID)) %>%
select(-tag_ok) %>%
left_join(tesaurus, by = c(AUTO.ID = "tag_id")) %>%
mutate(AUTO.ID = coalesce(tag_ok, AUTO.ID)) %>%
select(-tag_ok)
# MANUAL.ID AUTO.ID loc
# 1 <NA> Pippip L2
# 2 Pippip <NA> L1
# 3 Barbar <NA> L5
# 4 <NA> Pippip L3
or we can use match
(though I think it's not better than above):
result %>%
mutate(across(c(MANUAL.ID, AUTO.ID),
~ coalesce(tesaurus$tag_ok[match(., tesaurus$tag_id)], .)))
# MANUAL.ID AUTO.ID loc
# 1 <NA> Pippip L2
# 2 Pippip <NA> L1
# 3 Barbar <NA> L5
# 4 <NA> Pippip L3
Your added example with "Pippip,BerBar"
is going to do nothing since you did not include BerBar
in the tesaurus
lookup dictionary. However, if we add it:
tesaurus <- bind_rows(tesaurus, data.frame(tag_id="BerBar", tag_ok="Barbar"))
tesaurus
# tag_id tag_ok
# 1 PYPPYP Pippip
# 2 PIPpip Pippip
# 3 Pippip Pippip
# 4 Barbar Barbar
# 5 BerBar Barbar
we can use strsplit
and the match
lookup as follows:
result %>%
mutate(across(c(MANUAL.ID, AUTO.ID),
~ sapply(strsplit(., "[, ]+"), function(st) paste(coalesce(tesaurus$tag_ok[match(st, tesaurus$tag_id)], st), collapse = ","))))
# MANUAL.ID AUTO.ID loc
# 1 NA Pippip L2
# 2 Pippip NA L1
# 3 Barbar NA L5
# 4 NA Pippip L3
# 5 NA Pippip,Barbar L3