Search code examples
rlistmergeintersect

Most efficient way to find common values for one column across many data frames


I have many files and I am trying to find the most efficient way of reading the data frames and finding common values in one column.

For now I have: 1. I read a list of files using:

files = c("test1.txt", "test2.txt", test3.txt")
my.data <- lapply(files, read.table, header=T)

Each containing columns e.g.

df1 = data.frame(id=c("a", "b", "c"), v = c(1:3), c=c(10:12))
df2 = data.frame(id=c("x", "b", "c"), v = c(2:4), c=c(13:15))
df3 = data.frame(id=c("a", "n", "c"), v = c(4:6), c=c(16:18))

my.data = list(df1, df2, df3)

And now I am trying to subset the list of data frames to return the same list of data frames each containing only the common rows for the first column called "id", e.g.

df1, df2, and df3 in this case would be a list containing only "id" common to all read files, i.e. a row with only "c" in this case:
intersect(intersect(df1$id, df2$id), df3$id);
list(df1[3,], df2[3,], df3[3,])

but I can't figure out a way using lists to merge all data frames, maybe this is a longer/more difficult process than reading all files, merging them all first by the common column "id", and then splitting them into a list of data frames? Does anybody have any insight for most efficient ways? Thank you!


Solution

  • To find the common intersection of the id columns, you can use

    common <- Reduce(intersect, Map("[[", my.data, "id"))
    

    Then we can use that to subset the list elements.

    lapply(my.data, function(x) x[x$id %in% common, ])
    # [[1]]
    #   id v  c
    # 3  c 3 12
    #
    # [[2]]
    #   id v  c
    # 3  c 4 15
    #
    # [[3]]
    #   id v  c
    # 3  c 6 18