Search code examples
rdplyrset-difference

Determine Differences between Items in a List


I have several data frame that have a list of gene names without a header. Each files roughly looks like this:

Table 1

SCA-6_Chr1v1_00001
SCA-6_Chr1v1_00002
SCA-6_Chr1v1_00003
SCA-6_Chr1v1_00004
SCA-6_Chr1v1_00005
SCA-6_Chr1v1_00006
SCA-6_Chr1v1_00009
SCA-6_Chr1v1_00010
SCA-6_Chr1v1_00014
SCA-6_Chr1v1_00015
SCA-6_Chr1v1_00017

Table 2

SCA-6_Chr1v1_00001
SCA-6_Chr1v1_00002
SCA-6_Chr1v1_00003
SCA-6_Chr1v1_00007
SCA-6_Chr1v1_20005
SCA-6_Chr1v1_00006
SCA-6_Chr1v1_00009
SCA-6_Chr1v1_00200
SCA-6_Chr1v1_00014
SCA-6_Chr1v1_10075
SCA-6_Chr1v1_00100

Each of these data frames is written to a separate .txt file and I have uploaded them all into one list like so:

temp = list.files(pattern = "*.txt")
myfiles = lapply(temp, FUN=read.table, header=FALSE)

With the myfiles list I want to compare all of the data frames against each other and find values only found in that file once referenced to every other item in the list and return them in a list where each data frame in the new list only has those characters not found in any other list (I assume I can do this with a lapply function). I have tried running the following code but it is not dropping the shared values:

unique.genes = lapply(1:length(myfiles), function(n) setdiff(myfiles[[n]], unlist(myfiles[-n])))

Any help would be greatly appreciated.


Solution

  • Here is a way.

    • Start by reading in the data with scan. This will create vectors, not data.frames, which have a much slower access time.
    • Then the lapply/setdiff will keep the unique values in each vector.
    set.seed(2022)
    myfiles <- replicate(10, unique(sample(c(LETTERS, 0:9, letters), 10, replace = TRUE)), simplify = FALSE)
    l <- lapply(seq_along(myfiles), \(i) {write.table(myfiles[[i]], 
                                                 sprintf("test%02d.txt", i),
                                                 row.names = FALSE,
                                                 col.names = FALSE,
                                                 quote = FALSE)})
    rm(l)
    
    temp <- list.files(pattern = "*.txt")
    myfiles <- lapply(temp, FUN = read.table, header = FALSE)
    myfiles2 <- lapply(temp, FUN = scan, what = character())
    
    unique.genes <- lapply(1:length(myfiles), function(n) setdiff(myfiles[[n]][[1]], unlist(myfiles[-n])))
    unique.genes2 <- lapply(1:length(myfiles2), function(n) setdiff(myfiles2[[n]], unlist(myfiles2[-n])))
    
    identical(unique.genes, unique.genes2)
    #> [1] TRUE
    
    library(microbenchmark)
    mb <- microbenchmark(
      read.table = lapply(1:length(myfiles), function(n) setdiff(myfiles[[n]][[1]], unlist(myfiles[-n]))),
      scan = lapply(1:length(myfiles2), function(n) setdiff(myfiles2[[n]], unlist(myfiles2[-n])))
    )
    print(mb, order = "median", unit = "relative")
    #> Unit: relative
    #>        expr      min       lq     mean median       uq      max neval cld
    #>        scan 1.000000 1.000000 1.000000  1.000 1.000000 1.000000   100  a 
    #>  read.table 3.048491 2.921598 2.511883  2.945 2.750842 1.002187   100   b
    
    unlink(temp)
    

    Created on 2022-07-28 by the reprex package (v2.0.1)