Search code examples
rlistfor-loopk-means

In R iterate a function over a list of dataframes then store the output into a list with each output element named by input data frame


Hi I have a few dataframes each representing samples receiving a kind of treatment that I have combined into a list, the idea is I want to test Kmeans clustering method on each of the element/dataframe inside the list.

say I have these 7 dataframes that I bind into a list. Here are 2 of them as sample data https://drive.google.com/drive/folders/1B8JQY94Z-BHTZEKlV4dvUDocmiyppBDa?usp=sharing

Each dataframe has the same structure: many rows of samples and 107 columns of variables, but the 1st and 2nd columns are just data labels such as the actual drug treatment.

I want to perform Kmeans clustering on each of these dataframes, hoping to find representative samples from them for downstream processing.

So I build an output list called Kmeans.list to store the results. Am I correct to put this inside the loop? Specifically mylist[[i]][,-c(1:2)], this is aimed to take the ith dataframe in that list, and only take the actual numeric variable columns, then scale() it for kmeans clustering.

The reason that I haven't successfully test this is I also got confused about the output. The kmeans() function output a list, in which I'm interested particularly in the "centers". I want to really just store each of the centers results into a list so I can iterate other things downstream. Is this possible or I have to store all the kmeans output into this list, and somehow then take out the centers and bind them.

Either way, I have to be able to store the each kmeans output with an unique name so I can tell them apart. How do I make sure each element in the output list is named after the input dataframe? Like in names <-Kmeans.list[1] is df1 then so on

mylist <- list(df1,df2,df3...)

#kmeans this in a loop
#store output in a list
Kmeans.list <- list()

for (i in length(mylist)) {

  Kmeans.list[i] <- kmeans(scale(mylist[[i]][,-c(1:2)]),centers =15,nstart=50,iter.max = 100)

}

Solution

  • use tidyverse

    If I understood you correctly

    library(FNN)
    library(tidyverse)
    Kmeans.list <- map(.x = mylist,
                       .f = ~kmeans(scale(.x[,-c(1:2)]),
                                    centers =15,
                                    nstart=50,
                                    iter.max = 100)) %>% 
                        purrr::set_names(c("df1", "df2"))
    
    Kmeans_centers <- map(Kmeans.list, ~.x$centers)
    
    
    n500 <- map2(
      .x = mylist,
      .y = Kmeans_centers,
      .f = ~ get.knnx(data = scale(.x[, -c(1:2)]), query = .y, k = 500)) %>% 
      purrr::set_names(c("df1", "df2"))