I would like to combine data frames in the global environment according to the pattern in their name, and simultaneously add the name of the file they are originally from.
My problem is that I have originally a zip file, with over 20 text files in the main folder and sub-folders, which observe mainly two different scenarios: "test" and "train". Hence, I decided to first read ALL of the txt files into R, create two different lists of df names which either have "test" or "train" pattern and using those lists merge the dataframes into two main dataframes. Now, I need to combine those dataframes according to the names in the list, but the rbind just creates another list of their names - how to make rbind treat inputs as objects from the name list, not strings?
Moreover, rbind would combine the dfs without an opportunity to add the variable of their column names - maybe there is a solution which lets to simultaneously combine dfs and add the df name as a column variable?
What I did so far:
#loading the necessary libraries
library(dplyr)
library(readr)
library(easycsv)
#setting url and directory of the data file
url <- "https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip"
destination <- "accelerometer_data.zip"
#downloading the file and storing it into computer memory
download.file(url, destfile = destination)
#read all txt files into R
test_folder <- easycsv::fread_zip(file = destination,
extension = "TXT")
#create a list of "test" data frames
list_test <- as.list(
do.call(cbind, ls(
grep(pattern = "^UCI+(.*)test",
x = ls(),
value = TRUE)
)
)
)
)
#bind dfs as named in list_test
test_df <- lapply(list_test, FUN = function(x) {
rbind(
eval(
parse(text = x)
)
)
}
)
You can use mget
to get all the data with specific pattern in a list, then use dplyr::bind_rows
to combine them into one dataframe and use .id
parameter to include the file name as a separate column.
library(dplyr)
test_data <- bind_rows(mget(grep(pattern = "^UCI+(.*)test", x = ls(),
value = TRUE)), .id = 'filename')
train_data <- bind_rows(mget(grep(pattern = "^UCI+(.*)train", x = ls(),
value = TRUE)), .id = 'filename')
However, the 'test'
and 'train'
files have dataframes with different number of columns hence you have certain columns with only NA
s for some files. Maybe you need to update the pattern and make the pattern more strict?