Search code examples
sqlrdata.tableff

Reading very large fixed(ish) width format txt files from SQL Server Export into R data.tables or likewise


I'm trying to read in (and eventually merge/link/manipulate) a series of large (~300M) and very large (~4G) fixed width files for eventual regressions, visualizations, etc., and am hitting some snags.

First, the format of the files themselves is odd - I'm guessing something SQL-y. The file format is referenced here: https://msdn.microsoft.com/en-us/library/ms191479.aspx . It's fixed width, but the last column seems to (sometimes?) cut off with an \r\n before the full fixed width is experienced for that column. For reading it in I've tried laf_open_fwf and data.table::fread, but they both seem to get confused. A sample file and the associated non-XML format descriptor is here. I can't even get the thing to read in properly with that goofy last column. Here's a sample of the file:

1           1           7           7           ER
2           2           9           8           OI
3           54016       1988006     1953409     OI        
4           54017       1988014     1953415     ER        
5           54017       1988014     1953415     OB        

(but note that the CR/LF are invisible here, and it's their odd placement that is the issue. See the above link to the .txt file or png file (which I can't link, low rep) of a notepad++ view of the data to demonstrate the problem with the field.)

Second, file size is an issue. I know I have a lot of table manipulation to do, so I'm tempted to look at data.table... but I also believe data.table stores the entire object in RAM, and that's going to be problematic. LaF or ffdf or sqlite seem like options, though I'm new to them, and would need to cope with this file format issue first.

Some questions get at this general idea, suggesting LaF, ffbase or data.table are below...

Reading big data with fixed width

Quickly reading very large tables as dataframes in R

Speed up import of fixed width format table in R

... but none seems to (1) deal with this odd fixed-width-ish format or (2) move data eventually into data.tables, which seems like I'd like to try first. I thought about trying to open and rewrite them as well-formatted CSVs so data.table could handle them (my goofy hack through data.frames and back to csv feels ridiculous and unscalable, below). And the CSV export demonstates how confused the file gets, since the laf reader is strictly going by field length instead of adjusting based on where the /r/n is...

Currently I'm trying something like the below for starters. Help, if possible?

require("data.table", "LaF", "ffbase")
searchbasis.laf = laf_open_fwf("SEARCHBASIS.txt",
                               column_widths = c(12, 12, 12, 12, 10), 
                               column_names = c("SearchBasisID", "SearchID", "PersonID", "StopID", "Basis"),
                               column_types = rep("string",5),
                               trim = T)
# ^ The laf_open_fwf quietly "fails" because the last column doesn't always 
# have 10 chars, but sometimes ends short with /r/n after the element.
searchbasis.dt = as.data.table(as.data.frame(laf_to_ffdf(searchbasis.laf)))
write.csv(searchbasis.dt, file="SEARCHBASIS.csv")
# ^ To take a look at the file.  Confirms that the read from laf, transfer 
# to data.table is failing because of the last column issue.

Solution

  • For this particular file:

    form <- read.table("SEARCHBASIS_format.txt", as.is = TRUE, skip = 2)
    x <- read.table("SEARCHBASIS.txt", col.names = form$V7, as.is = TRUE)
    

    If you sometimes have strings including spaces you'll almost certainly need to process the file externally first.

    If you're planning to read really large files I'd suggest (presuming you have awk on your path):

    x <- setNames(data.table::fread("awk '{$1=$1}1' SEARCHBASIS.txt"), form$V7)
    

    If you want to use fixed widths you could use:

    x <- setNames(fread("gawk 'BEGIN {OFS = \"\t\"; FIELDWIDTHS = \"12 12 12 12 12\"} {for (i = 1; i<= NF; i++) {gsub(/ +$/, \"\", $i);}}1' SEARCHBASIS.txt"), form$V7)
    

    You can also pull the widths from the format file:

    x <- setNames(fread(paste0("gawk 'BEGIN {OFS = \"\t\"; FIELDWIDTHS = \"", paste(form$V4, collapse = " "), "\"} {for (i = 1; i<= NF; i++) {gsub(/ +$/, \"\", $i);}}1' SEARCHBASIS.txt")), form$V7)
    

    Note $1=$1 forces awk to reevaluate the fields and the 1 at the end is effectively shorthand for print. I've also assumed you want to strip trailing spaces from each field.

    On Windows you'll need to use single quotes in R and replace the single quotes within the command with " and the nested double quotes with "". So the last one above becomes:

    x <- setNames(fread(paste0('gawk \"BEGIN {OFS = ""\t""; FIELDWIDTHS = ""', paste(form$V4, collapse = " "), '""} {for (i = 1; i<= NF; i++) {gsub(/ +$/, """", $i);}}1" SEARCHBASIS.txt')), form$V7)
    

    For a cross-platform solution, you'll need to put your awk script in an external file:

    stripSpace.awk

    BEGIN {OFS="\t"} {for (i = 1; i<= NF; i++) {gsub(/ +$/, "", $i);}}1
    

    R code

    x <- setNames(fread(paste0('gawk -v FIELDWIDTHS="', paste(form$V4, collapse = " "), '" -f stripSpace.awk SEARCHBASIS.txt')), form$V7)
    

    Tested on Scientific Linux 6 and Windows 8.1