Search code examples
rdataframetextimporttext-files

Reading text file regions using patterns


I have a large number of text files containing rows of data and extra information. I would like to loop through the files and combine the data of interest into a single dataframe.

Each text file contains random information (rows of sentences, ect..) that i dont care about before and after actual data, but the exact number of rows before and after the data are highly inconsistent across text files. Thus, I cannot use typical arguments like skip or n_max to specify the rows I wish to read.

The only consistent patterns in the files are:

  • before the data starts, there is a row containing the column headers for the data, and a row containing series of dashes
  • When the data ends, there is a blank row, followed by a row that starts with the word "finished", and another row of dashes

examples of the data files are below: File 1:

i dont care
not important
this row is not important
Header starts on the next row 
Index   Date      Time        DP1     Name
--------------------------------------------------
     1  07-20-22  17:48:06    3792123           machine 3
     2  07-20-22  17:38:06    379211            machine 3
     3  07-20-22  19:28:06                      machine
     4  07-20-22  19:48:06    379245            machine 
     5  07-20-22  17:58:06    37921             machine 2

--------------------------------------------------
finished blah blah
more rows

File2:

i dont care about this row and would like to remove it
Header starts on the next row 
Index   Date      Time        DP1     Name
--------------------------------------------------
     1  07-20-22  17:48:06                      machine 4
     2  07-20-22  17:38:06                      machine 8
     3  07-20-22  19:28:06                      machine
     10 07-20-22  19:48:06    379245            machine 
     11 07-20-22  17:58:06    37921             machine 10

--------------------------------------------------
finished blah blah

Note the following:

  • possible blanks in the fourth column DP1
  • inconsistent spacing between datapoints
  • unpredictable lengths of words and sentences above and below the "data"
  • the Name column could be one word or contain a space between a word and a number

Is there a way to use consistent patterns to loop through these files and compile the data of interest without having to touch the raw text files? My interest in this is not only for speed in manipulating the data, but to remove human-induced error and lack of transparency that could occur if I manipulate the raw files by hand.


Solution

  • Here is my attempt at a solution. Let me know if it works. If it requires edits, please provide many diverse examples to illustrate your expectations.

    The first line reads the file in as a character string, the second splits on newlines. However that produces a listed output, so we have to unlist to obtain a vector. We convert the vector to a tibble column, then we slice from "index" (optionally preceded by spaces) to two rows above finished. The filter steps removes blank or dashed rows with no numbers or letters. We trim the column, and if there are any very long spaces (15 or longer) we put an NA there to fill in gaps. We also replace the pattern letter space number with letter_number so we can split on spaces in the next step. Lastly, we unnest wider then do some cleanup work.

    library(tidyverse)
    library(janitor)
    library(lubridate)
    
    read_file("/Users/davidcsuka/Desktop/file1.txt") %>%
      str_split("\\n") %>%
      unlist() %>%
      as_tibble_col("col1") %>%
      slice((which(str_detect(col1, "^\\s*[Ii]ndex"))[1]):(which(str_detect(col1, "^\\s*[Ff]inished"))[1]-2)) %>%
      filter(str_detect(col1, "[:alnum:]")) %>%
      mutate(col1 = str_trim(col1),
             col1 = str_replace_all(col1, c("\\s{15,}" = " NA ",
                                            "(?<=[:alpha:])\\s{1,2}(?=\\d)" = "_")),
             col1 = str_split(col1, "\\s+")) %>%
      unnest_wider(col1, names_sep = "_") %>%
      row_to_names(1) %>%
      unite(col = DateTime, matches("[Dd]ate|[Tt]ime"), sep = " ", na.rm = FALSE) %>%
      type_convert() %>%
      mutate(DateTime = parse_date_time(DateTime, "%m%d%y%H%M%S"))