I have several data frame that have a list of gene names without a header. Each files roughly looks like this:
Table 1
SCA-6_Chr1v1_00001
SCA-6_Chr1v1_00002
SCA-6_Chr1v1_00003
SCA-6_Chr1v1_00004
SCA-6_Chr1v1_00005
SCA-6_Chr1v1_00006
SCA-6_Chr1v1_00009
SCA-6_Chr1v1_00010
SCA-6_Chr1v1_00014
SCA-6_Chr1v1_00015
SCA-6_Chr1v1_00017
Table 2
SCA-6_Chr1v1_00001
SCA-6_Chr1v1_00002
SCA-6_Chr1v1_00003
SCA-6_Chr1v1_00007
SCA-6_Chr1v1_20005
SCA-6_Chr1v1_00006
SCA-6_Chr1v1_00009
SCA-6_Chr1v1_00200
SCA-6_Chr1v1_00014
SCA-6_Chr1v1_10075
SCA-6_Chr1v1_00100
Each of these data frames is written to a separate .txt
file and I have uploaded them all into one list like so:
temp = list.files(pattern = "*.txt")
myfiles = lapply(temp, FUN=read.table, header=FALSE)
With the myfiles
list I want to compare all of the data frames against each other and find values only found in that file once referenced to every other item in the list and return them in a list where each data frame in the new list only has those characters not found in any other list (I assume I can do this with a lapply
function). I have tried running the following code but it is not dropping the shared values:
unique.genes = lapply(1:length(myfiles), function(n) setdiff(myfiles[[n]], unlist(myfiles[-n])))
Any help would be greatly appreciated.
Here is a way.
scan
. This will create vectors, not data.frames, which have a much slower access time.lapply/setdiff
will keep the unique values in each vector.set.seed(2022)
myfiles <- replicate(10, unique(sample(c(LETTERS, 0:9, letters), 10, replace = TRUE)), simplify = FALSE)
l <- lapply(seq_along(myfiles), \(i) {write.table(myfiles[[i]],
sprintf("test%02d.txt", i),
row.names = FALSE,
col.names = FALSE,
quote = FALSE)})
rm(l)
temp <- list.files(pattern = "*.txt")
myfiles <- lapply(temp, FUN = read.table, header = FALSE)
myfiles2 <- lapply(temp, FUN = scan, what = character())
unique.genes <- lapply(1:length(myfiles), function(n) setdiff(myfiles[[n]][[1]], unlist(myfiles[-n])))
unique.genes2 <- lapply(1:length(myfiles2), function(n) setdiff(myfiles2[[n]], unlist(myfiles2[-n])))
identical(unique.genes, unique.genes2)
#> [1] TRUE
library(microbenchmark)
mb <- microbenchmark(
read.table = lapply(1:length(myfiles), function(n) setdiff(myfiles[[n]][[1]], unlist(myfiles[-n]))),
scan = lapply(1:length(myfiles2), function(n) setdiff(myfiles2[[n]], unlist(myfiles2[-n])))
)
print(mb, order = "median", unit = "relative")
#> Unit: relative
#> expr min lq mean median uq max neval cld
#> scan 1.000000 1.000000 1.000000 1.000 1.000000 1.000000 100 a
#> read.table 3.048491 2.921598 2.511883 2.945 2.750842 1.002187 100 b
unlink(temp)
Created on 2022-07-28 by the reprex package (v2.0.1)