I have a folder with over 1000 text files showing pollutant levels from particular air quality stations.
I'm looking to combine all these text files into one csv in R, so i can analyse the data temporally in one space.
Each text file is organised as below, with the unit name, the start time of the particular set of observations and then the columns for the data.
Example of the headers of my text file 1:
Unit 12345678
Start time: Wed Jan 29 10:57:58 2020
**dd/mm/yyyy hh:mm:ss, PM1, PM2.5, PM10, TSP, RHpre, Tpre, DPpre, RHpost, Tpost, DPpost**
29/01/2020 10:59:00, 1.39, 4.70, 17.11, 172.64, 36.10, 23.11, 7.17, 12.49, 41.26, 7.09
29/01/2020 11:00:00, 1.21, 3.64, 15.68, 26.39, 36.59, 23.12, 7.32, 12.41, 41.52, 7.17
29/01/2020 11:01:00, 1.20, 3.65, 15.12, 93.69, 36.51, 23.18, 7.43, 12.39, 41.68, 7.31
29/01/2020 11:02:00, 1.29, 4.09, 11.93, 15.31, 36.19, 23.22, 7.42, 12.30, 41.79, 7.37
29/01/2020 11:03:00, 1.30, 3.74, 9.06, 11.90, 36.04, 23.26, 7.33, 12.27, 41.88, 7.27
29/01/2020 11:04:00, 1.33, 4.31, 18.62, 44.38, 35.98, 23.28, 7.33, 12.21, 41.97, 7.34
Example of text file 2
Unit 12345678
Start time: Wed Jan 29 11:14:46 2020
**dd/mm/yyyy hh:mm:ss, PM1, PM2.5, PM10, TSP, RHpre, Tpre, DPpre, RHpost, Tpost, DPpost**
29/01/2020 11:16:00, 1.29, 4.80, 12.68, 14.96, 36.77, 23.15, 7.69, 14.41, 38.14, 6.58
29/01/2020 11:17:00, 1.24, 3.97, 13.30, 18.04, 37.51, 23.13, 7.58, 14.23, 38.57, 6.76
29/01/2020 11:18:00, 1.13, 3.50, 16.80, 60.72, 37.09, 23.16, 7.80, 14.11, 38.89, 6.84
29/01/2020 11:19:00, 1.33, 4.56, 14.23, 71.32, 38.96, 23.22, 8.25, 14.24, 39.15, 7.04
29/01/2020 11:20:00, 1.23, 3.72, 16.87, 22.36, 38.13, 23.29, 8.47, 14.00, 39.39, 7.27
29/01/2020 11:21:00, 1.17, 4.47, 12.30, 15.60, 37.00, 23.34, 8.36, 13.86, 39.62, 7.24
29/01/2020 11:22:00, 1.28, 4.18, 12.80, 229.03, 36.27, 23.36, 7.54, 13.70, 39.85, 7.37
29/01/2020 11:23:00, 1.34, 4.28, 17.27, 96.94, 36.19, 23.37, 7.50, 13.54, 40.05, 7.30
So for each text file, the first (station ID) and the third (column names) will remain the same for the particular station but the second line will change with each output produced by the monitor.
As mentioned above I am looking to combine all these text files together, but under the uniform header of the column names (dd/mm/yyyy hh:mm:ss, PM1, PM2.5, PM10, TSP, RHpre, Tpre, DPpre, RHpost, Tpost, DPpost), because this will be consistent in every monitor i have access too and therefore the code can be easily reproduced.
I've attempted:
mypath = "C:/Desktop/mytxtfolder/"
txt_files_ls = list.files(path=mypath, pattern="*.txt")
txt_files_df <- lapply(txt_files_ls, function(x) {read.table(file = x,skip =3, header = T, sep =",")})
combined_df <- do.call("rbind", lapply(txt_files_df, as.data.frame))
And get consistent errors of
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
I assume this is because the second row values (time of uploads) do not match and i am using the function incorrectly to skip the first two rows and only combine on the third row.
First of all, I think that do.call(dplyr::bind_rows, txt_files_df)
would already solve the error you are seeing with base::rbind
, because bind_rows
doesn't crash when the columns of its inputs aren't aligned. It just adds new columns to the result in that case.
Secondly, you can also make your code a little cleaner, using purrr
's map_dfr
, which applies a function over the elements of a list and row-binds the results robustly using dplyr
. Like this:
library(dplyr)
library(purrr)
library(readr)
combined_df <- purrr::map_dfr(txt_files_ls, function(x) {
readr::read_csv(x, skip = 3, trim_ws = T)
})
However, with the error you are getting, I would guess that either the headers are not always the same or it is not a constant of 3 rows you need to skip.
You can test for that by going through your list and test all loaded dataframes on whether the colnames are the same as in the first one. For example:
test <- txt_files_df %>%
purrr::discard(~identical(colnames(.), colnames(txt_files_df[[1]])))
I'm using purrr::discard
to throw out any entry where the colnames are as expected, so your final result should be empty - but if it isn't, you know you need to check your data or adjust your code to be more robust if that's not possible.
I'd suggest adding the filename to the dataframes you read, so that you can identify which file has provided you odd inputs. Also, in case the leading lines are the culprit, let's explicitly check where the header is and skip lines accordingly:
combined_df <- purrr::map_dfr(txt_files_ls, function(x) {
first_10_lines <- readLines(x, 10L)
header_line <- min(which(grepl('**dd/mm/yyyy hh:mm:ss', first_10_lines, fixed = T)))
df <- readr::read_csv(x, skip = header_line - 1, trim_ws = T)
df$file_name <- x # allowing you to know what file this data came from
df
})
// update, responding to OP's issue with non matching column types:
I am receiving errors
Error: Can't combine PM1 <double> and PM1 <character>
There's two ways to attack this:
<NA>
's and thus be "lost" (you will receive a warning though):combined_df <- purrr::map_dfr(txt_files_ls, function(x) {
first_10_lines <- readLines(x, 10L)
header_line <- min(which(grepl('**dd/mm/yyyy hh:mm:ss', first_10_lines, fixed = T)))
df <- readr::read_csv(
x,
skip = header_line - 1,
trim_ws = T,
col_types = cols(
`**dd/mm/yyyy hh:mm:ss` = col_datetime(format = "%d/%m/%Y %H:%M:%S"),
.default = col_double()
)
)
df$file_name <- x # allowing you to know what file this data came from
df
})
readr::type_convert
guess the types later, after rowbinding.combined_df <- purrr::map_dfr(txt_files_ls, function(x) {
first_10_lines <- readLines(x, 10L)
header_line <- min(which(grepl('**dd/mm/yyyy hh:mm:ss', first_10_lines, fixed = T)))
df <- readr::read_csv(
x,
skip = header_line - 1,
trim_ws = T,
col_types = cols(
`**dd/mm/yyyy hh:mm:ss` = col_datetime(format = "%d/%m/%Y %H:%M:%S"),
.default = col_character()
)
)
df$file_name <- x # allowing you to know what file this data came from
df
}) %>%
readr::type_convert()