I'm trying to come up with a neat/fast way to read files delimited by newline (\n
) characters into more than one column.
Essentially in a given input file, multiple rows in the input file should become a single row in the output, however most file reading functions sensibly interpret the newline character as signifying a new row, and so they end up as a data frame with a single column. Here's an example:
The input files look like this:
Header Info
2021-01-01
text
...
@
2021-01-02
text
...
@
...
Where the ...
represents potentially multiple rows in the input file, and the @
signifies what should really be the end of a row in the output data frame. So upon reading this file, it should become a data frame like this (ignoring the header):
X1 | X2 | ... | Xn |
---|---|---|---|
2021-01-01 | text | ... | ... |
2021-01-02 | text | ... | ... |
... | ... | ... | ... |
I've tried base
, data.table
, readr
and vroom
, and they all have one of two outputs, either a data frame with a single column, or a vector. I want to avoid a for loop, and so my current solution is using base::readLines()
, to read it as a character vector, then manually adding some "proper" column separators (e.g. ;
), and then joining and splitting again.
# Save the example data to use as input
writeLines(c("Header Info", "2021-01-01", "text", "@", "2021-01-02", "text", "@"), "input.txt")
input <- readLines("input.txt")
input <- paste(input[2:length(input)], collapse = ";") # Skip the header
input <- gsub(";@;*", replacement = "\n", x = input)
input <- strsplit(unlist(strsplit(input, "\n")), ";")
input <- do.call(rbind.data.frame, input)
# Clean up the example input
unlink("input.txt")
My code above works and gives the desired result, but surely there's a better way?? Edit: This is internal in a function, so part (perhaps the larger part) of the intention of any simplification is to improve the speed.
Thanks in advance!
1) Read in the data, locate the @ signs giving logical variable at and then create a grouping variable g which has distinct values for each desired line. Finally use tapply with paste to rework it into lines that can be read using read.table and read it. (If there are commas in the data then use some other separating character.)
L <- readLines("input.txt")[-1]
at <- grepl("@", L)
g <- cumsum(at)
read.table(text = tapply(L[!at], g[!at], paste, collapse = ","),
sep = ",", col.names = cnames)
giving this data frame:
V1 V2
1 2021-01-01 text
2 2021-01-02 text
2) Another approach is to rework the data into dcf form by removing the @ sign and prefacing other lines with their column name and a colon. Then use read.dcf. cnames is a character vector of column names that you want to use.
cnames <- c("Date", "Text")
L <- readLines("input.txt")[-1]
LL <- sub("@", "", paste0(c(paste0(cnames, ": "), ""), L))
DF <- as.data.frame(read.dcf(textConnection(LL)))
DF[] <- lapply(DF, type.convert, as.is = TRUE)
DF
giving this data frame:
Date Text
1 2021-01-01 text
2 2021-01-02 text
3) This approach simply reshapes the data into a matrix and then converts it to a data frame. Note that (1) converts numeric columns to numeric whereas this one just leaves them as character.
L <- readLines("input.txt")[-1]
k <- grep("@", L)[1]
as.data.frame(matrix(L, ncol = k, byrow = TRUE))[, -k]
## V1 V2
## 1 2021-01-01 text
## 2 2021-01-02 text
The question did not mention speed as a consideration but in a comment it was later mentioned. Based on the data in the benchmark below (1) runs over twice as fast as the code in the question and (3) runs nearly 25x faster.
library(microbenchmark)
writeLines(c("Header Info",
rep(c("2021-01-01", "text", "@", "2021-01-02", "text", "@"), 10000)),
"input.txt")
library(microbenchmark)
writeLines(c("Header Info", rep(c("2021-01-01", "text", "@", "2021-01-02", "text", "@"), 10000)), "input.txt")
microbenchmark(times = 10,
ques = {
input <- readLines("input.txt")
input <- paste(input[2:length(input)], collapse = ";") # Skip the header
input <- gsub(";@;*", replacement = "\n", x = input)
input <- strsplit(unlist(strsplit(input, "\n")), ";")
input <- do.call(rbind.data.frame, input)
},
ans1 = {
L <- readLines("input.txt")[-1]
at <- grepl("@", L)
g <- cumsum(at)
read.table(text = tapply(L[!at], g[!at], paste, collapse = ","), sep = ",")
},
ans3 = {
L <- readLines("input.txt")[-1]
k <- grep("@", L)[1]
as.data.frame(matrix(L, ncol = k, byrow = TRUE))[, -k]
})
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## ques 1146.62 1179.65 1188.74 1194.78 1200.11 1219.01 10 c
## ans1 518.95 522.75 548.33 532.59 561.55 647.14 10 b
## ans3 50.47 51.19 51.68 51.69 52.25 52.52 10 a