Search code examples
rloopsmerge

Loop through a folder with .txt files and merge with a different file R


I want to loop through a folder that contains 250 .txt files and merge them with a single .txt file outside of this folder based on a common id column. Each file in the folder has over 6 million rows and is approximately 1GB on disk, so creating a list with them and reading them all at once is not feasible. The other_dataset is much smaller and is about 2000 rows of all unique values in the id column (no duplicates in this column.)

user_1 through user_250 are the of 250 files in a folder. other_dataset is the other .txt file I would like to merge with.

Sample data (with reduced columns) is included below:

user_1<- structure(list(ID2 = c(3481890, 3500406, 3507786, 3507978, 3512641, 3528872, 3546395, 3546395, 3572638, 3578447, 3581236, 3581236, 3581236, 3581236, 3599403, 3602306, 3603380, 3604665, 3612597, 3623200, 3623200), country = c("India", "India", "India", "Israel", "India", "India", "India", "India", "Belgium", "Israel", 
"India", "India", "India", "India", "India", "India", "United States", 
"India", "Bulgaria", "India", "India"), id = c(197273, 197273, 
197273, 197273, 197273, 197273, 197273, 197273, 197273, 197273, 
197273, 197273, 197273, 197273, 197273, 197273, 197273, 197273, 
197273, 197273, 197273)), row.names = 2000000:2000020, class = "data.frame")
.
.
.
user_250<- structure(list(ID2 = c(1707253, 3039610, 3243469, 1613433, 
1871123, 1925738, 4313027, 1889109, 4554686, 1090264, 2848825, 
1194603, 1489041, 1489041, 1955845, 3401637, 2134398, 2216639, 
2216639, 2007909, 2311831), country = c("United States", "United States", 
"United States", "Philippines", "United States", "India", "Chile", 
"Bahrain", "Ethiopia", "Romania", "Brazil", "Brazil", "Mexico", 
"Mexico", "Spain", "Spain", "United Kingdom", "United States", 
"United States", "", "Australia"), id = c(92782473, 92782473, 
92782473, 92782474, NA_real_, NA_real_, NA_real_, NA_real_, 92782477, 
92782479, 92782479, 92782480, 92782481, 92782481, 92782481, 92782481, 
92782486, 92782487, 92782487, 92782488, 92782488)), row.names = 6000500:6000520, class = "data.frame")

The following is the dataset I would like to merge the above data with.

other_dataset<- structure(list(id = c(197273, 92782473, 
197272, 197274, 197275, 197276, 197277, 708823, 708824, 708825, 92782487, 92782488, 92782489, 2782479, 92782475, 92782483, 92782481, 708822, 708824, 
708835), year = c(1951, 1951L, 1951, 1951, 1951, 
1951, 1951, 1951, 1951, 1951, 1951, 1951, 1951, 1951, 
1951, 1951, 1951, 1951, 1951, 1951)), row.names = c(NA, 
20L), class = "data.frame")

While there are similar questions such as This question as my question includes a large number of files (250) versus the 5 files included in this question. I also would like to create a loop to merge the 250.txt inside this folder with the single .txt file outside of this folder.

While I could make a list with the following method, due to the very large size of each of the 250 .txt files I would rather not do that as it takes a very long time. Is there a way to do an inner_join with the other_dataset one by one in a loop and only keep the relevant rows (matching ids)?

my_files <- lapply(Sys.glob("/Users/myname/user_positions/user_*.txt"), read.delim)

Solution

  • Maybe something like this? [EDIT: changed from csv to txt]

    library(dplyr)
    
    # save files to temp directory, as example
    # These have one unneeded column -- imagine it's lots
    td <- tempdir()
    readr::write_tsv(user_1 |> mutate(unneeded_col = 1), file.path(td, "user_001.txt"))
    readr::write_tsv(user_250 |> mutate(unneeded_col = 1), file.path(td, "user_250.txt"))
    
    # get file names & load into named list
    my_files <- list.files(td, pattern = "*.txt", full.names = TRUE)
    
    # function to read file and just keep certain columns
    read_and_pare <- \(x) {
      x |> 
        read_delim() |>
        select(ID2, country, id)
    }
    # read data files into list, stack into one table, join to other table
    purrr::map(my_files, read_and_pare) |>
      setNames(my_files) |>
      bind_rows(.id = "user_file") |>
      left_join(other_dataset,
                join_by(id == id))