Search code examples
rlistgroupingapplyskip

Putting skip in function that groups files together into nested list


I have a folder with 325 spreadsheets with election results for different election districts of Moscow. I am trying to group together files that belong to the same municipal district (higher level of aggregation) so I can aggregate election results at this level. (See end for dput output of what file names look like).

I have created a function that matches the files correctly by extracting the part of the string before the electoral district number:

mf.vote.matcher <- function(file, filelist){

  #matches everything in the file name before the word "vote" (i.e. the mf name)
  match_string <- str_extract(file, pattern = ".*(?=vote)")
  matched_files <- grep(filelist, pattern = match_string)

  #listing
  matched_list <- list(filelist[matched_files])

}

However, when applied using lapply to the full file list, it moves through every file, creating a list with many redundant elements. E.g. there are 3 election districts in the first municipal district, resulting in function output that repeats these 3 files name 3 times.

Is there some way to force the function or lapply to "skip" to the files for the next municipal district based on the length of the returned list?

Here's a sample of the file names:

c("./Vote/Академический vote 1.xls", "./Vote/Академический vote 2.xls", 
"./Vote/Академический vote 3.xls", "./Vote/Алексеевский в городе Москве vote 1.xls", 
"./Vote/Алексеевский в городе Москве vote 2.xls", "./Vote/Алтуфьевский vote 1.xls", 
"./Vote/Алтуфьевский vote 2.xls", "./Vote/Алтуфьевский vote 3.xls", 
"./Vote/Арбат vote 1.xls", "./Vote/Арбат vote 2.xls", "./Vote/Аэропорт vote 1.xls", 
"./Vote/Аэропорт vote 2.xls", "./Vote/Аэропорт vote 3.xls", "./Vote/Бабушкинский vote 1.xls", 
"./Vote/Бабушкинский vote 2.xls", "./Vote/Басманный vote 1.xls", 
"./Vote/Басманный vote 2.xls", "./Vote/Басманный vote 3.xls", 
"./Vote/Беговой vote 1.xls", "./Vote/Беговой vote 2.xls", "./Vote/Бескудниковский vote 1.xls", 
"./Vote/Бескудниковский vote 2.xls", "./Vote/Бибирево vote 1.xls", 
"./Vote/Бибирево vote 2.xls", "./Vote/Бибирево vote 3.xls")

Solution

  • Alternatively, you could loop over the unique districts.

    E.g.

    library(stringr)
    
    dat <- c("./Vote/Академический vote 1.xls", "./Vote/Академический vote 2.xls", 
                 "./Vote/Академический vote 3.xls", "./Vote/Алексеевский в городе Москве vote 1.xls", 
                 "./Vote/Алексеевский в городе Москве vote 2.xls", "./Vote/Алтуфьевский vote 1.xls", 
                 "./Vote/Алтуфьевский vote 2.xls", "./Vote/Алтуфьевский vote 3.xls", 
                 "./Vote/Арбат vote 1.xls", "./Vote/Арбат vote 2.xls", "./Vote/Аэропорт vote 1.xls", 
                 "./Vote/Аэропорт vote 2.xls", "./Vote/Аэропорт vote 3.xls", "./Vote/Бабушкинский vote 1.xls", 
                 "./Vote/Бабушкинский vote 2.xls", "./Vote/Басманный vote 1.xls", 
                 "./Vote/Басманный vote 2.xls", "./Vote/Басманный vote 3.xls", 
                 "./Vote/Беговой vote 1.xls", "./Vote/Беговой vote 2.xls", "./Vote/Бескудниковский vote 1.xls", 
                 "./Vote/Бескудниковский vote 2.xls", "./Vote/Бибирево vote 1.xls", 
                 "./Vote/Бибирево vote 2.xls", "./Vote/Бибирево vote 3.xls")
    
    
    out = lapply(unique(str_extract_all(dat, ".*(?=vote)", simplify = TRUE)[, 1]), function(x) {
      dat[grepl(x, dat)]
    }
    )
    
    > out
    [[1]]
    [1] "./Vote/Академический vote 1.xls" "./Vote/Академический vote 2.xls" "./Vote/Академический vote 3.xls"
    
    [[2]]
    [1] "./Vote/Алексеевский в городе Москве vote 1.xls" "./Vote/Алексеевский в городе Москве vote 2.xls" 
    
    ...etc