Search code examples
rlistknnmap-function

R iterate a function over the matching elements in 2 lists


So I'm trying to perform Kmeans clustering on each element(dataframe) in a list, from the outputs of the kmeans clustering, I took the "centers" that matches each data frame and bind all the centers into another list.

Next, what I want to do is to use the function get.knnx(), so I can use each centers generated by kmeans clustering and with that going back to the original data frame to sample 500 data points that are the closest to the centre, to achieve a good subsampling of the data. (The reason I did not use the kmeans cluster membership assigned is because the data to perform the kmeans is just a subsampling of the original dataset for training)

Each dataframe has the same structure: many rows of samples and 107 columns of variables, but the 1st and 2nd columns are just data labels such as the actual drug treatment.

Here is the link towards 2 sample data https://drive.google.com/drive/folders/1B8JQY94Z-BHTZEKlV4dvUDocmiyppBDa?usp=sharing

library(tidyverse)
library(purr)
#take data into list
mylist <- list(df1,df2,df3...)

#perform Kmeans cluster
#scale datainput and drop the data label column
Kmeans.list <- map(.x = mylist,
               .f = ~kmeans(scale(.x[,-c(1:2)]),
                            centers =15,
                            nstart=50,
                            iter.max = 100)) %>% 
                purrr::set_names(c("df1", "df2"))

#Isolate the Centers info to another list
 Kmeans_centers <- map(Kmeans.list, ~.x$centers)

#trying to use map2
y <- map2(.x = mylist,.y=Kmeans_centers,
     .f=~get.knnx(scale(.x[,-c(1:2)],.y, 500)))

Thanks to the help from legends on Stackoverflow, I was manage to make the kmeans work and get the centers list. Now I want to use the same logic to use map2()

Now the error I get from map2 is "Error in scale.default(.x[, -c(1:2)], .y, 500) : length of 'center' must equal the number of columns of 'x'"

However, both lists have 7 elements, I don't know quite what went wrong.

Additional question is regarding the ~ in the .f= argument. I read it that if I have a function input, I don't need to add ~, however, in this case if I remove ~, error says x not found. So why ~ is needed here, and shall I always put ~ in front of the function I put in map() argument?


Solution

  • You should apply scale function only to the dataframe.

    library(purrr)
    library(FNN)
    
    map2(.x = mylist,.y=Kmeans_centers, .f=~get.knnx(scale(.x[,-c(1:2)]),.y, 500))
    

    ~ is a formula based syntax to apply the function where the first argument is referred as .x and the second one as .y. It is an alternative to using an anonymous function which can be written as

    map2(.x = mylist,.y=Kmeans_centers, function(a, b) get.knnx(scale(a[,-c(1:2)]),b, 500))