I have a large number of text files containing rows of data and extra information. I would like to loop through the files and combine the data of interest into a single dataframe.
Each text file contains random information (rows of sentences, ect..) that i dont care about before and after actual data, but the exact number of rows before and after the data are highly inconsistent across text files. Thus, I cannot use typical arguments like skip
or n_max
to specify the rows I wish to read.
The only consistent patterns in the files are:
examples of the data files are below: File 1:
i dont care
not important
this row is not important
Header starts on the next row
Index Date Time DP1 Name
--------------------------------------------------
1 07-20-22 17:48:06 3792123 machine 3
2 07-20-22 17:38:06 379211 machine 3
3 07-20-22 19:28:06 machine
4 07-20-22 19:48:06 379245 machine
5 07-20-22 17:58:06 37921 machine 2
--------------------------------------------------
finished blah blah
more rows
File2:
i dont care about this row and would like to remove it
Header starts on the next row
Index Date Time DP1 Name
--------------------------------------------------
1 07-20-22 17:48:06 machine 4
2 07-20-22 17:38:06 machine 8
3 07-20-22 19:28:06 machine
10 07-20-22 19:48:06 379245 machine
11 07-20-22 17:58:06 37921 machine 10
--------------------------------------------------
finished blah blah
Note the following:
DP1
Name
column could be one word or contain a space between a word and a numberIs there a way to use consistent patterns to loop through these files and compile the data of interest without having to touch the raw text files? My interest in this is not only for speed in manipulating the data, but to remove human-induced error and lack of transparency that could occur if I manipulate the raw files by hand.
Here is my attempt at a solution. Let me know if it works. If it requires edits, please provide many diverse examples to illustrate your expectations.
The first line reads the file in as a character string, the second splits on newlines. However that produces a listed output, so we have to unlist to obtain a vector. We convert the vector to a tibble column, then we slice from "index" (optionally preceded by spaces) to two rows above finished. The filter steps removes blank or dashed rows with no numbers or letters. We trim the column, and if there are any very long spaces (15 or longer) we put an NA there to fill in gaps. We also replace the pattern letter space number with letter_number so we can split on spaces in the next step. Lastly, we unnest wider then do some cleanup work.
library(tidyverse)
library(janitor)
library(lubridate)
read_file("/Users/davidcsuka/Desktop/file1.txt") %>%
str_split("\\n") %>%
unlist() %>%
as_tibble_col("col1") %>%
slice((which(str_detect(col1, "^\\s*[Ii]ndex"))[1]):(which(str_detect(col1, "^\\s*[Ff]inished"))[1]-2)) %>%
filter(str_detect(col1, "[:alnum:]")) %>%
mutate(col1 = str_trim(col1),
col1 = str_replace_all(col1, c("\\s{15,}" = " NA ",
"(?<=[:alpha:])\\s{1,2}(?=\\d)" = "_")),
col1 = str_split(col1, "\\s+")) %>%
unnest_wider(col1, names_sep = "_") %>%
row_to_names(1) %>%
unite(col = DateTime, matches("[Dd]ate|[Tt]ime"), sep = " ", na.rm = FALSE) %>%
type_convert() %>%
mutate(DateTime = parse_date_time(DateTime, "%m%d%y%H%M%S"))