Search code examples
rdataframeindexingdplyrrecode

Replace values in a dataset based off an index of values in another using base R


structure(list(ID = c(123, 5345, 234, 453, 3656, 345), diagnosis_1 = c("B657", 
"B658", "B659", "B660", "B661", "B662"), diagnosis_2 = c("F8827", 
"G432", NA, "B657", NA, "H8940"), diagnosis_3 = c(NA, "B657", 
NA, NA, NA, "G432"), diagnosis_4 = c(NA, NA, NA, NA, NA, "B657"
), diagnosis_5 = c(NA, NA, NA, NA, NA, NA), diagnosis_6 = c(NA, 
NA, NA, NA, NA, NA), diagnosis_7 = c(NA, NA, NA, NA, NA, NA), 
    diagnosis_8 = c(NA, NA, NA, NA, NA, NA), diagnosis_9 = c(NA, 
    NA, NA, NA, NA, NA), diagnosis_10 = c(NA, NA, NA, NA, NA, 
    NA), diagnosis_11 = c(NA, NA, NA, NA, NA, NA), diagnosis_12 = c(NA, 
    NA, NA, NA, NA, NA), diagnosis_13 = c(NA, NA, NA, NA, NA, 
    NA), age = c(54, 65, 23, 22, 33, 77)), row.names = c(NA, 
-6L), class = "data.frame")

I would like to replace the values in the diagnosis columns with the values from this index:

B657    1
B658    2
B659    3
B660    4
B661    5
B662    1
F8827   3
G432    3
H8940   4

In reality the table has thousands of rows and I deal with other tables with a variable number of diagnosis columns so a solution which is agnostic to the number of columns would be ideal. The index is also up to a few hundred entries long..

If the index table was divided like this:

1 B657, B662
2 B658
3 B659, F8827, G432 
4 B660 H8940    
5 B661

Would that make a difference to the way it is coded?

The desired output would look like this:

enter image description here

Many thanks


Solution

  • You can use match to change content using a lookup table.

    i <- startsWith(colnames(x), "diagnosis_")
    x[,i] <- y[match(unlist(x[,i]), y[,1]),2]
    x
    #    ID diagnosis_1 diagnosis_2 diagnosis_3 diagnosis_4 diagnosis_5 diagnosis_6 diagnosis_7 diagnosis_8 diagnosis_9 diagnosis_10 diagnosis_11 diagnosis_12 diagnosis_13 age
    #1  123           1           3          NA          NA          NA          NA          NA          NA          NA           NA           NA           NA           NA  54
    #2 5345           2           3           1          NA          NA          NA          NA          NA          NA           NA           NA           NA           NA  65
    #3  234           3          NA          NA          NA          NA          NA          NA          NA          NA           NA           NA           NA           NA  23
    #4  453           4           1          NA          NA          NA          NA          NA          NA          NA           NA           NA           NA           NA  22
    #5 3656           5          NA          NA          NA          NA          NA          NA          NA          NA           NA           NA           NA           NA  33
    #6  345           1           4           3           1          NA          NA          NA          NA          NA           NA           NA           NA           NA  77
    

    And in case the lookup has a the given different structure:

    zz <- strsplit(z, "[, ]+")
    zz <- setNames(rep(seq_along(zz), lengths(zz)), unlist(zz))
    i <- startsWith(colnames(x), "diagnosis_")
    x[,i] <- zz[unlist(x[,i])]
    

    In case codes are not found and you don't want to set them to NA.

    i <- startsWith(colnames(x), "diagnosis_")
    j <- match(unlist(x[,i]), y[,1])
    k <- !is.na(j)
    tt <- unlist(x[,i])
    tt[k] <- y[j[k],2]
    x[,i] <- tt
    rm(i, j, k, tt)
    

    Data:

    x <- structure(list(ID = c(123, 5345, 234, 453, 3656, 345), diagnosis_1 = c("B657", 
    "B658", "B659", "B660", "B661", "B662"), diagnosis_2 = c("F8827", 
    "G432", NA, "B657", NA, "H8940"), diagnosis_3 = c(NA, "B657", 
    NA, NA, NA, "G432"), diagnosis_4 = c(NA, NA, NA, NA, NA, "B657"
    ), diagnosis_5 = c(NA, NA, NA, NA, NA, NA), diagnosis_6 = c(NA, 
    NA, NA, NA, NA, NA), diagnosis_7 = c(NA, NA, NA, NA, NA, NA), 
        diagnosis_8 = c(NA, NA, NA, NA, NA, NA), diagnosis_9 = c(NA, 
        NA, NA, NA, NA, NA), diagnosis_10 = c(NA, NA, NA, NA, NA, 
        NA), diagnosis_11 = c(NA, NA, NA, NA, NA, NA), diagnosis_12 = c(NA, 
        NA, NA, NA, NA, NA), diagnosis_13 = c(NA, NA, NA, NA, NA, 
        NA), age = c(54, 65, 23, 22, 33, 77)), row.names = c(NA, 
                                                             -6L), class = "data.frame")
    y <- read.table(text="B657    1
    B658    2
    B659    3
    B660    4
    B661    5
    B662    1
    F8827   3
    G432    3
    H8940   4")
    z <- readLines(con=textConnection("B657, B662
    B658
    B659, F8827, G432
    B660 H8940
    B661"))