I want to loop through a folder that contains 250 .txt
files and merge them with a single .txt
file outside of this folder based on a common id
column. Each file in the folder has over 6 million rows and is approximately 1GB on disk, so creating a list with them and reading them all at once is not feasible. The other_dataset
is much smaller and is about 2000 rows of all unique values in the id
column (no duplicates in this column.)
user_1
through user_250
are the of 250 files in a folder. other_dataset
is the other .txt
file I would like to merge with.
Sample data (with reduced columns) is included below:
user_1<- structure(list(ID2 = c(3481890, 3500406, 3507786, 3507978, 3512641, 3528872, 3546395, 3546395, 3572638, 3578447, 3581236, 3581236, 3581236, 3581236, 3599403, 3602306, 3603380, 3604665, 3612597, 3623200, 3623200), country = c("India", "India", "India", "Israel", "India", "India", "India", "India", "Belgium", "Israel",
"India", "India", "India", "India", "India", "India", "United States",
"India", "Bulgaria", "India", "India"), id = c(197273, 197273,
197273, 197273, 197273, 197273, 197273, 197273, 197273, 197273,
197273, 197273, 197273, 197273, 197273, 197273, 197273, 197273,
197273, 197273, 197273)), row.names = 2000000:2000020, class = "data.frame")
.
.
.
user_250<- structure(list(ID2 = c(1707253, 3039610, 3243469, 1613433,
1871123, 1925738, 4313027, 1889109, 4554686, 1090264, 2848825,
1194603, 1489041, 1489041, 1955845, 3401637, 2134398, 2216639,
2216639, 2007909, 2311831), country = c("United States", "United States",
"United States", "Philippines", "United States", "India", "Chile",
"Bahrain", "Ethiopia", "Romania", "Brazil", "Brazil", "Mexico",
"Mexico", "Spain", "Spain", "United Kingdom", "United States",
"United States", "", "Australia"), id = c(92782473, 92782473,
92782473, 92782474, NA_real_, NA_real_, NA_real_, NA_real_, 92782477,
92782479, 92782479, 92782480, 92782481, 92782481, 92782481, 92782481,
92782486, 92782487, 92782487, 92782488, 92782488)), row.names = 6000500:6000520, class = "data.frame")
The following is the dataset I would like to merge the above data with.
other_dataset<- structure(list(id = c(197273, 92782473,
197272, 197274, 197275, 197276, 197277, 708823, 708824, 708825, 92782487, 92782488, 92782489, 2782479, 92782475, 92782483, 92782481, 708822, 708824,
708835), year = c(1951, 1951L, 1951, 1951, 1951,
1951, 1951, 1951, 1951, 1951, 1951, 1951, 1951, 1951,
1951, 1951, 1951, 1951, 1951, 1951)), row.names = c(NA,
20L), class = "data.frame")
While there are similar questions such as This question as my question includes a large number of files (250) versus the 5 files included in this question. I also would like to create a loop to merge the 250.txt
inside this folder with the single .txt
file outside of this folder.
While I could make a list with the following method, due to the very large size of each of the 250 .txt
files I would rather not do that as it takes a very long time. Is there a way to do an inner_join
with the other_dataset
one by one in a loop and only keep the relevant rows (matching id
s)?
my_files <- lapply(Sys.glob("/Users/myname/user_positions/user_*.txt"), read.delim)
Maybe something like this? [EDIT: changed from csv to txt]
library(dplyr)
# save files to temp directory, as example
# These have one unneeded column -- imagine it's lots
td <- tempdir()
readr::write_tsv(user_1 |> mutate(unneeded_col = 1), file.path(td, "user_001.txt"))
readr::write_tsv(user_250 |> mutate(unneeded_col = 1), file.path(td, "user_250.txt"))
# get file names & load into named list
my_files <- list.files(td, pattern = "*.txt", full.names = TRUE)
# function to read file and just keep certain columns
read_and_pare <- \(x) {
x |>
read_delim() |>
select(ID2, country, id)
}
# read data files into list, stack into one table, join to other table
purrr::map(my_files, read_and_pare) |>
setNames(my_files) |>
bind_rows(.id = "user_file") |>
left_join(other_dataset,
join_by(id == id))