Search code examples
rdata-conversiondata-management

Custom data-dependent recoding to logicals in R


I have two data frames, data and meta. Some, but not all, columns in data are logical values, but they are coded in many different ways. The rows in meta describe the columns in data, indicate whether they are to be interpreted as logicals, and if so, what single value codes TRUE and what single value codes FALSE.

I need a procedure that replaces all data values in conceptually logical columns with the appropriate logical values from the codes in the corresponding meta row. Any data values in a conceptually logical column that do not match a value in the corresponding meta row should become NA.

Small toy example for meta:

name                 type     false  true
-----------------------------------------
a.char.var           char     NA     NA
a.logical.var        logical  NA     7
another.logical.var  logical  1      0
another.char.var     char     NA     NA

Small toy example for data:

a.char.var  a.logical.var  another.logical.var  another.char.var
----------------------------------------------------------------
aa          7              0                    ba
ab          NA             1                    bb
ac          7              NA                   bc
ad          4              3                    bd

Small toy example output:

a.char.var  a.logical.var  another.logical.var  another.char.var
----------------------------------------------------------------
aa          TRUE           TRUE                 ba
ab          FALSE          FALSE                bb
ac          TRUE           NA                   bc
ad          NA             NA                   bd

I cannot, for the life of me, find a way to do this in idiomatic R that handles all the corner cases. The data sets are large, so an idiomatic solution would be ideal if possible. I inherited this absolutely insane data management mess and will be grateful to anybody who can help fix it. I am by no means an R guru, but this seems like a deceptively difficult problem.


Solution

  • First we set up the data

    meta <- data.frame(name=c('a.char.var', 'a.logical.var', 'another.logical.var', 'another.char.var'),
                       type=c('char', 'logical', 'logical', 'char'),
                       false=c(NA, NA, 1, NA),
                       true=c(NA, 7, 0, NA), stringsAsFactors = F)
    
    data <- data.frame(a.char.var=c('aa', 'ab', 'ac', 'ad'),
                       a.logical.var=c(7, NA, 7, 4),
                       another.logical.var=c(0,1,NA,3),
                       another.char.var=c('ba', 'bb', 'bc', 'bd'), stringsAsFactors = F)
    

    Then we subset out just the logical columns. We will iterate through these, using the name column to select the relevant column in data, and change values in data_out from an initialized NA to either T or F according to matching values in data.

    Note that data[,logical_meta$name[1]] is equivalent to data[,'a.logical.var'] or data$a.logical.var, if logical_meta$name is a character. If it's a factor (eg if we didn't specify stringsAsFactors=F) we need to convert to character at which point we might as well give it a name - colname below.

    Having NAs to contend with means using which is advantageous: c(0, 1,NA,3)==0 returns T,F,NA,F but which then ignores the NA and returns just the position 1. Subsetting by a logical vector that includes NAs yields NA rows or columns, using which eliminates this.

    logical_meta <- meta[meta$type=='logical',]
    
    data_out <- data #initialize
    
    
    for(i in 1:nrow(logical_meta)) {
      colname <- as.character(logical_meta$name[i]) #only need as.character if factor
      data_out[,colname] <- NA
      #false column first
      if(is.na(logical_meta$false[i])) {
        data_out[is.na(data[,colname]),colname] <- FALSE
      } else {
        data_out[which(data[,colname]==logical_meta$false[i]),
                 colname] <- FALSE
      }
      #true column next
      if(is.na(logical_meta$true[i])) {
        data_out[is.na(data[,colname]),colname] <- TRUE
      } else {
        data_out[which(data[,colname]==logical_meta$true[i]),
                 colname] <- TRUE
      }
    }
    
    data_out