I have a data.table
with 168 variables
and 8,278 observations
. Variables 69:135
are originally stored as strings. They should become region dummies and I want to end up with levels 2 (=yes, company operates here) and 1 (=no, company does not operate here). The issue is that there are three different combinations of input in the orginal variables: 1) "TRUE", "1", "0", "FALSE", 2) "TRUE", "FALSE", and 3) "1", "0". Moreover, approx. 5 variables do only have one value, either "0" or "1". An example is given here:
#generating replicable data
region1 = structure(c("TRUE", "FALSE", "0", "1", NA), class = "character"),
region2 = structure(c("1", "1", "0", NA, NA), class = "character"),
region3 = structure(c(NA, "FALSE", "TRUE", NA, "FALSE"), class = "character"),
region4 = structure(c(NA, "0", "0", NA, "0"), class = "character")),
.Names = c("region1", "region2", "region3", "region4"), row.names = c(NA, 5), class = "data.table")
#this gives:
# region1 region2 region3 region4
#1 TRUE 1 <NA> <NA>
#3 0 0 TRUE 0
#4 1 <NA> <NA> <NA>
#5 <NA> <NA> FALSE 0
I'm searching for a way to replace "TRUE" and "1" by 2 and "FALSE" and "0" by 1 for all the variables at once. So the desired result is:
# region1 region2 region3 region4
#1: 2 2 NA NA
#2: 1 2 1 1
#3: 1 1 2 1
#4: 2 NA NA NA
#5: NA NA 1 1
I've already had a look at
Apply factor levels to multiple columns with missing factor levels and Change level of multiple factor variables.
However, this does not help me.
I tried the following using a nested ifelse()
check <- cbind(dt[1:68], as.data.table(apply(dt[69:135], 2, function(x) {
ifelse("1" %in% x & "TRUE" %in% x,
"2" = c("TRUE",
"1" = c("FALSE",
ifelse("1" %in% x & !("TRUE" %in% x),
"2" = "1",
"1" = "0"),
"2" = "TRUE",
"1" = "FALSE"
)), dt[136:168])
But the previous code does not give me the desired result. It runs through but I get a warning message and when checking the respective variables they are still stored as strings with their original inputs.
# examples of warnings
1: Unknown levels in `f`: TRUE, FALSE
2: Unknown levels in `f`: TRUE, FALSE
3: Unknown levels in `f`: TRUE, FALSE
4: Unknown levels in `f`: 0
5: Unknown levels in `f`: TRUE, FALSE
6: Unknown levels in `f`: TRUE, FALSE
7: Unknown levels in `f`: 0
On their own and also when not being combined with fct_collapse
the nested ifelse()
commands do the job:
#the ifelse statement works
ifelse("TRUE" %in% dt$region1, 2, "FALSE")
ifelse(5 %in% dt$region1, 2, "FALSE")
#also the nested ifelse statement works
ifelse("1" %in% dt$region1 & "TRUE" %in% dt$region1,
ifelse("1" %in% dt$region1 & !("TRUE" %in% dt$region1),
ifelse("1" %in% dt$region2 & "TRUE" %in% dt$region2,
ifelse("1" %in% dt$region2 & !("TRUE" %in% dt$region2),
Does anyone have an idea how to solve this?
Thanks so much for any advice in advance!
Here is a way with set()
called in a for
f <- function(x){
x <- as.character(x)
i1 <- x %in% c("TRUE", "1")
i0 <- x %in% c("FALSE", "0")
x[which(i1)] <- "2"
x[which(i0)] <- "1"
for (j in seq_along(dt)) set(dt, j = j, value = f(dt[[j]]))
# region1 region2 region3 region4
#1: 2 2 NA NA
#2: 1 2 1 1
#3: 1 1 2 1
#4: 2 NA NA NA
#5: NA NA 1 1
Thanks to jangorecki's comment a much simpler way is
dt[, names(dt) := lapply(dt, f)]