In a nested list with three levels, I need to reach the third level (i.e., resources) and check whether some special characters exist in the data. I use the code below to do it:
obj1 <- list(resource = list(bodyPart = c("leg", "arm", "knee"),side = c("LEFT", "RIGHT", "LEFT"), device = c("LLI", "LSM", "GHT"), id = c("AA", "BB", "CC")) %>% as.data.frame(), cat = list(lab = c("aa", "bb", "cc"), id = c(23, 24, 25)) %>% as.data.frame())
obj2 <- list(resource = list(bodyPart = c("leg", "arm", "knee"), side = c("LEFT", "LEFT", "LEFT"), device = c("GOM", "LSM", "YYY"), id = c("ZZ", "DD", "FF")) %>% as.data.frame())
x <- list(foo = c(fer = "wdb", obj1), bar = obj2)
library(tibble)
data.frame(has_invalid_character = x |> rapply(f = \(node) grepl("[^\x01-\x7F]", node)),
content = x |> rapply(f = \(node) node)
)
Now I need to add the information included in the id
column to the row names so that I can identify where each case happens in my data. for instance, instead of foo.resource.bodyPart1
, I need to have foo.resource.bodyPart.AA
. Similarly, foo.resource.bodyPart2
, should be replaced by foo.resource.bodyPart.BB
Could you please advise me on it?
sample output:
has_invalid_character content
foo.fer FALSE wdb
foo.resource.bodyPart1.AA FALSE leg
foo.resource.bodyPart2.BB FALSE arm
foo.resource.bodyPart3.CC FALSE knee
foo.resource.side1.AA FALSE LEFT
foo.resource.side2.BB FALSE RIGHT
foo.resource.side3.CC FALSE LEFT
foo.cat.lab1.23 FALSE aa
foo.cat.lab1.24 FALSE bb
foo.cat.lab1.25 FALSE cc
based on your example data:
library(purrr)
library(tibble)
## ------------------ see (1)
modify_tree(x, leaf = \(node) {
if(is.data.frame(node) && !is.null(node$id)){
ids <- node$id
lapply(node, FUN = \(col) setNames(col, ids))
} else {
setNames(node, 'NA')
}
}) |>
## ------------------ see (2)
unlist() |>
data.frame() |>
setNames(nm = 'content') |>
rownames_to_column('parent') |>
rowwise() |>
mutate(frags = strsplit(parent, '\\.'),
parent = paste(rev(rev(frags)[-1]),collapse = '.'),
id = tail(frags, 1)
) |>
select(-frags) |>
## filter out the content of the initial id columns:
filter(!grepl('\\.id', parent)) |>
## do the actual scan for unwanted characters:
mutate(has_invalid_char = grepl('[^\x01-\x7F]', content))
(1) this one operates recursively down the nested lists, checking whether the node at hand is a dataframe and bears an ID-column (named "id"). If so, it digests the dataframe by returning each column as a vector, named with the ID-column. Caution: currently ignores dataframes without ID-column. If the node is not an ID-ed dataframe, it is returned as a vector with 'NA'-character as ID.
(2) routing reshaping and processing, finally running the character check
# A tibble: 22 x 4
# Rowwise:
parent content id has_invalid_char
<chr> <chr> <chr> <lgl>
1 foo.fer wdb NA FALSE
2 foo.resource.bodyPart leg AA FALSE
3 foo.resource.bodyPart arm BB FALSE
4 foo.resource.bodyPart knee CC FALSE
5 foo.resource.side LEFT AA FALSE
6 foo.resource.side RIGHT BB FALSE
7 foo.resource.side LEFT CC FALSE
8 foo.resource.device LLI AA FALSE
9 foo.resource.device LSM BB FALSE
10 foo.resource.device GHT CC FALSE
# i 12 more rows