The following data is read in with readLines(). It need to be converted into a data frame. The problem is the 3rd column is sometimes blank. Is there a way to split the column by fixed length like in Excel, or other way to handle the 3rd column? Thanks for any help.
CSTE DFHTACP 255 N 0 0 0 0 0 0
CSTP DFHZCSTP 255 N 0 0 0 0 0 0
CSXM DFHXMAB 0 N 0 0 0 0 0 0
CSZI DFHSZRMP 1 N 0 0 0 0 0 0
CTIN DFHZCT1 DFHCOMCL 254 N 0 0 0 0 0 0
CTSD DFHTSDQ DFHTSDEL 254 N 0 0 0 0 0 0
CVMI DFHMIRS 1 N 0 0 0 0 0 0
CWBA DFHWBA 15 N 6124306 0 0 0 0 0
1) read.fwf This can be done using read.fwf
as mentioned in the comments. This is straight-forward although it requires tedious determination of the field widths.
w <- c(7, 11, 10, 5, 22, 10, 9, 9, 9, 9, 12)
read.fwf("myfile.dat", w)
The remaining methods assume there is at least one space between each field. That is at least the case in the sample input.
2) read.pattern() This requires the development version of gsubfn package. This does require a long regular expression although its a reasonably straight forward one matching spaces (" +"
), non-spaces ("\\S+"
) and optional non-spaces ("\\S*"
).
library(gsubfn)
library(devtools) # source_url supports https whereas source only supports http
source_url("https://gsubfn.googlecode.com/svn/trunk/R/read.pattern.R") # from dev repo
# the third capture group is "\\S*" whereas the others are "\\S+"
pat_ <- rep(c("(\\S+) +", "(\\S*) +", "(\\S+) +", "(\\S+)"), c(2, 1, 7, 1))
pat <- paste0( c( "^ *", pat_, " *$"), collapse = "")
read.pattern("myfile.dat", pattern = pat, as.is = TRUE)
3) gsubfn() With slightly more work we can use gsubfn()
in the same package in which case we can use the CRAN version of the package. Use pat
defined above:
Lines <- readLines("myfile.dat")
tmp <- gsubfn(pat, ... ~ paste(..., sep = ","), Lines)
read.table(text = tmp, sep = ",", as.is = TRUE)
4) count.fields We can explicitly count fields in each line to get another solution. This requires no addon packages. It avoids manual field width counting but does require more code.
Lines <- readLines("myfile.dat")
k <- count.fields("myfile.dat")
Lines2 <- gsub("^ *| *$", "", Lines) # trim whitespace from beginning and end
Lines3 <- ifelse(k == 10, sub("^(\\S+ *\\S+)", "\\1,", Lines2), Lines2) # insert extra ,
Lines4 <- gsub(" +", ",", Lines3) # replace each string of spaces with a ,
read.table(text = Lines4, sep = ",", as.is = TRUE)
5) sub Slightly shorter and still without addon packages is this one. We use sub
to insert an extra comma into the short lines and then replace each string of spaces with a comma and read:
Lines <- readLines("myfile.dat")
pat2 <- "^( *(\\S+ +){2})((\\S+ +){7}\\S+) *$"
g <- gsub(" +", ",", sub(pat2, "\\1,\\3", Lines))
read.table(text = g, sep = ",", as.is = TRUE)
REVISED Additional solutions.