Search code examples
rlistcsvuniquelapply

Combine, Order, Dedup over Multiple Files in R


I have a large number of CSV files that look like this:

var val1 val2
a 2 1
b 2 2
c 3 3
d 9 2
e 1 1

I would like to:

  1. Read them in
  2. Take the top 3 from each CSV
  3. Make a list of the variable names only (3 x number of files)
  4. Keep only the unique names on the list

I think I have managed to get to point 3 by doing this:

csvList <- list.files(path = "mypath", pattern = "*.csv", full.names = T)

bla <- lapply(lapply(csvList, read.csv), function(x) x[order(x$val1, decreasing=T)[1:3], ])

lapply(bla,"[", , 1, drop=FALSE)

Now, I have a list of the top 3 variables in each CSV. However, I don't know how to convert this list to a string and keep only the unique values.

Any help is welcome.

Thank you!


Solution

  • The issue is in extracting the first columns of bla with drop=FALSE. This preserves the results as a list of columns (where each row has a name) instead of coercing it to its lowest dimension, which is a vector. Use drop=TRUE instead and then unlist followed by unique as @Frank suggests:

    unique(unlist(lapply(bla,"[", , 1, drop=TRUE)))
    

    As you know, drop=TRUE is the default, so you don't even have to include it.


    Update to new requirements in comments.

    To keep the first two columns var and var1 and remove duplicates in var (keep only the unique vars), do the following:

    ## unlist each column in turn and form a data frame
    res <- data.frame(lapply(c(1,2), function(x) unlist(lapply(bla,"[", , x))))
    colnames(res) <- c("var","var1")    ## restore the two column names
    ## remove duplicates
    res <- res[!duplicated(res[,1]),]
    

    Note that this will only keep the first row for each unique var. This is the definition of removing duplicates here.

    Hope this helps.