Search code examples
rsplitfixed

Column split by fixed length in R


The following data is read in with readLines(). It need to be converted into a data frame. The problem is the 3rd column is sometimes blank. Is there a way to split the column by fixed length like in Excel, or other way to handle the 3rd column? Thanks for any help.

   CSTE   DFHTACP             255                     N         0        0        0        0        0           0
   CSTP   DFHZCSTP            255                     N         0        0        0        0        0           0
   CSXM   DFHXMAB               0                     N         0        0        0        0        0           0
   CSZI   DFHSZRMP              1                     N         0        0        0        0        0           0
   CTIN   DFHZCT1   DFHCOMCL  254                     N         0        0        0        0        0           0
   CTSD   DFHTSDQ   DFHTSDEL  254                     N         0        0        0        0        0           0
   CVMI   DFHMIRS               1                     N         0        0        0        0        0           0
   CWBA   DFHWBA               15                     N   6124306        0        0        0        0           0

Solution

  • 1) read.fwf This can be done using read.fwf as mentioned in the comments. This is straight-forward although it requires tedious determination of the field widths.

    w <- c(7, 11, 10, 5, 22, 10, 9, 9, 9, 9, 12)
    read.fwf("myfile.dat", w)
    

    The remaining methods assume there is at least one space between each field. That is at least the case in the sample input.

    2) read.pattern() This requires the development version of gsubfn package. This does require a long regular expression although its a reasonably straight forward one matching spaces (" +"), non-spaces ("\\S+") and optional non-spaces ("\\S*").

    library(gsubfn)
    library(devtools) # source_url supports https whereas source only supports http
    source_url("https://gsubfn.googlecode.com/svn/trunk/R/read.pattern.R") # from dev repo
    
    # the third capture group is "\\S*" whereas the others are "\\S+"
    pat_ <- rep(c("(\\S+) +", "(\\S*) +", "(\\S+) +", "(\\S+)"), c(2, 1, 7, 1))
    pat <- paste0( c( "^ *", pat_, " *$"), collapse = "")
    read.pattern("myfile.dat", pattern = pat, as.is = TRUE)
    

    3) gsubfn() With slightly more work we can use gsubfn() in the same package in which case we can use the CRAN version of the package. Use pat defined above:

    Lines <- readLines("myfile.dat")
    tmp <- gsubfn(pat, ... ~ paste(..., sep = ","), Lines)
    read.table(text = tmp, sep = ",", as.is = TRUE)
    

    4) count.fields We can explicitly count fields in each line to get another solution. This requires no addon packages. It avoids manual field width counting but does require more code.

    Lines <- readLines("myfile.dat")
    
    k <- count.fields("myfile.dat")
    Lines2 <- gsub("^ *| *$", "", Lines) # trim whitespace from beginning and end
    Lines3 <- ifelse(k == 10, sub("^(\\S+ *\\S+)", "\\1,", Lines2), Lines2) # insert extra ,
    
    Lines4 <- gsub(" +", ",", Lines3) # replace each string of spaces with a ,
    read.table(text = Lines4, sep = ",", as.is = TRUE)
    

    5) sub Slightly shorter and still without addon packages is this one. We use sub to insert an extra comma into the short lines and then replace each string of spaces with a comma and read:

    Lines <- readLines("myfile.dat")
    pat2 <- "^( *(\\S+ +){2})((\\S+ +){7}\\S+) *$"
    g <- gsub(" +", ",", sub(pat2, "\\1,\\3", Lines))
    
    read.table(text = g, sep = ",", as.is = TRUE)
    

    REVISED Additional solutions.