I have a folder with 325 spreadsheets with election results for different election districts of Moscow. I am trying to group together files that belong to the same municipal district (higher level of aggregation) so I can aggregate election results at this level. (See end for dput
output of what file names look like).
I have created a function that matches the files correctly by extracting the part of the string before the electoral district number:
mf.vote.matcher <- function(file, filelist){
#matches everything in the file name before the word "vote" (i.e. the mf name)
match_string <- str_extract(file, pattern = ".*(?=vote)")
matched_files <- grep(filelist, pattern = match_string)
#listing
matched_list <- list(filelist[matched_files])
}
However, when applied using lapply
to the full file list, it moves through every file, creating a list with many redundant elements. E.g. there are 3 election districts in the first municipal district, resulting in function output that repeats these 3 files name 3 times.
Is there some way to force the function or lapply
to "skip" to the files for the next municipal district based on the length of the returned list?
Here's a sample of the file names:
c("./Vote/Академический vote 1.xls", "./Vote/Академический vote 2.xls",
"./Vote/Академический vote 3.xls", "./Vote/Алексеевский в городе Москве vote 1.xls",
"./Vote/Алексеевский в городе Москве vote 2.xls", "./Vote/Алтуфьевский vote 1.xls",
"./Vote/Алтуфьевский vote 2.xls", "./Vote/Алтуфьевский vote 3.xls",
"./Vote/Арбат vote 1.xls", "./Vote/Арбат vote 2.xls", "./Vote/Аэропорт vote 1.xls",
"./Vote/Аэропорт vote 2.xls", "./Vote/Аэропорт vote 3.xls", "./Vote/Бабушкинский vote 1.xls",
"./Vote/Бабушкинский vote 2.xls", "./Vote/Басманный vote 1.xls",
"./Vote/Басманный vote 2.xls", "./Vote/Басманный vote 3.xls",
"./Vote/Беговой vote 1.xls", "./Vote/Беговой vote 2.xls", "./Vote/Бескудниковский vote 1.xls",
"./Vote/Бескудниковский vote 2.xls", "./Vote/Бибирево vote 1.xls",
"./Vote/Бибирево vote 2.xls", "./Vote/Бибирево vote 3.xls")
Alternatively, you could loop over the unique districts.
E.g.
library(stringr)
dat <- c("./Vote/Академический vote 1.xls", "./Vote/Академический vote 2.xls",
"./Vote/Академический vote 3.xls", "./Vote/Алексеевский в городе Москве vote 1.xls",
"./Vote/Алексеевский в городе Москве vote 2.xls", "./Vote/Алтуфьевский vote 1.xls",
"./Vote/Алтуфьевский vote 2.xls", "./Vote/Алтуфьевский vote 3.xls",
"./Vote/Арбат vote 1.xls", "./Vote/Арбат vote 2.xls", "./Vote/Аэропорт vote 1.xls",
"./Vote/Аэропорт vote 2.xls", "./Vote/Аэропорт vote 3.xls", "./Vote/Бабушкинский vote 1.xls",
"./Vote/Бабушкинский vote 2.xls", "./Vote/Басманный vote 1.xls",
"./Vote/Басманный vote 2.xls", "./Vote/Басманный vote 3.xls",
"./Vote/Беговой vote 1.xls", "./Vote/Беговой vote 2.xls", "./Vote/Бескудниковский vote 1.xls",
"./Vote/Бескудниковский vote 2.xls", "./Vote/Бибирево vote 1.xls",
"./Vote/Бибирево vote 2.xls", "./Vote/Бибирево vote 3.xls")
out = lapply(unique(str_extract_all(dat, ".*(?=vote)", simplify = TRUE)[, 1]), function(x) {
dat[grepl(x, dat)]
}
)
> out
[[1]]
[1] "./Vote/Академический vote 1.xls" "./Vote/Академический vote 2.xls" "./Vote/Академический vote 3.xls"
[[2]]
[1] "./Vote/Алексеевский в городе Москве vote 1.xls" "./Vote/Алексеевский в городе Москве vote 2.xls"
...etc