Search code examples
rreplacelevels

Ignore specific levels when performing lapply in R


I have a data frame (500 obs of 40000 variables) in R where all columns consist of one or two letters interspersed with '1' and '3'. E.g., mydata[45:50,20:25]

45             C             A             3             T             C             C
46             C             G             T             C             C             A
47             C             A             G             T             C             C
48             1             A             T             3             C             3
49             C             A             G             T             C             C
50             T             A             T             C             C             A

I wish to replace the letters only not the numbers. My goal is for the letters to be replaced with '0' or '2' depending on their frequency. The most frequent letter therefore becoming '0' and the least frequent becoming '2'. If there is only one letter, that would become '0'.

I can achieve this without ignoring the interspersed '1' and '3' using:

data.frame(lapply(mydata[45:50,20:25], function(x){as.numeric(factor(x, levels = names(sort(-table(x)))))}))

which yields:

1             1             1             3             1             1             1
2             1             2             1             2             1             2
3             1             1             2             1             1             1
4             2             1             1             3             1             3
5             1             1             2             1             1             1
6             3             1             1             2             1             2

However, I would like to be able to do that while ignoring '1' and '3' in the original data frame.

Any help appreciated. Thank you.


Solution

  • I would work with a matrix here.

    Using grep we make a table of frequencies which we rank on their negative values and subtract one to get zero. Since I'm not sure what you want in case of ties I chose "first" to get an integer (see ?rank for options).

    Then we match the letters on the frequencies. Finally we convert back to data frame using type.convert to get numeric formats.

    m <- as.matrix(d)
    
    ftb <- table(grep("[\\p{Lu}]", m, perl=TRUE, value=TRUE))
    ftb <- rank(-ftb, ties.method="first") - 1
    
    m.res <- apply(m, 1:2, function(x) ifelse(x %in% names(ftb), ftb[match(x, names(ftb))], x))
    d.res <- type.convert(as.data.frame(m.res))
    d.res
    #   V1 V2 V3 V4 V5 V6 V7
    # 1 45  0  1  3  2  0  0
    # 2 46  0  3  2  0  0  1
    # 3 47  0  1  3  2  0  0
    # 4 48  1  1  2  3  0  3
    # 5 49  0  1  3  2  0  0
    # 6 50  2  1  2  0  0  1
    

    Edit

    Since you want to look into the column frequencies, we may use the approach in an lapply (without matrix conversion). We can multiply the rank then by a factor 2.

    f <- 2
    d[-1] <- lapply(d[-1], function(x) {
      ftb <- (rank(-table(grep("[\\p{Lu}]", x, perl=TRUE, value=TRUE)),
                  ties.method="first") - 1)*f
      stopifnot(length(ftb) <= 2)
      x <- ifelse(x %in% names(ftb), ftb[match(x, names(ftb))], x)
      as.numeric(x)
    })
    d
    #   V1 V2 V3 V4 V5 V6 V7
    # 1 45  0  0  3  0  0  0
    # 2 46  0  2  0  2  0  2
    # 3 47  0  0  2  0  0  0
    # 4 48  1  0  0  3  0  3
    # 5 49  0  0  2  0  0  0
    # 6 50  2  0  0  2  0  2
    

    Data:

    d <- structure(list(V1 = 45:50, V2 = c("C", "C", "C", "1", "C", "T"
    ), V3 = c("A", "G", "A", "A", "A", "A"), V4 = c("3", "T", "G", 
    "T", "G", "T"), V5 = c("T", "C", "T", "3", "T", "C"), V6 = c("C", 
    "C", "C", "C", "C", "C"), V7 = c("C", "A", "C", "3", "C", "A"
    )), class = "data.frame", row.names = c(NA, -6L))