Search code examples
parsinglarge-data

Reading and parsing a large .dat file


I am trying to parse a huge .dat file (4gb). I have tried with R but it just takes too long. Is there a way to parse a .dat file by segments, for example every 30000 lines? Any other solutions would also be welcomed. This is what it looks like:
enter image description here

These are the first two lines with header: ST|ZIPCODE|GEO_ID|GEO_TTL|FOOTID_GEO|NAICS2012|NAICS2012_TTL|FOOTID_NAICS|YEAR|EMPSZES|EMPSZES_TTL|ESTAB|ESTAB_F <br/> 01|35004|8610000US35004|35004(MOODY,AL)||00|Total for all sectors||2012|001|All establishments|167| <br/> 01|35004|8610000US35004|35004(MOODY,AL)||00|Total for all sectors||2012|212|Establishments with 1 to 4 employees|91|


Solution

  • This is an option to read data faster in R by using the fread function in the data.table package.

    EDIT

    I removed all <br/> new-line tags. This is the edited dataset

    ST|ZIPCODE|GEO_ID|GEO_TTL|FOOTID_GEO|NAICS2012|NAICS2012_TTL|FOOTID_NAICS|YEAR|EMPSZES|EMPSZES_TTL|ESTAB|ESTAB_F
    01|35004|8610000US35004|35004(MOODY,AL)||00|Total for all sectors||2012|001|All establishments|167| 
    01|35004|8610000US35004|35004(MOODY,AL)||00|Total for all sectors||2012|212|Establishments with 1 to 4 employees|91| 
    

    Then I matched variables with classes. You should use nrows ~ 100.

    colclasses = sapply(read.table(edited_data, nrows=1, sep="|", header=T),class)
    

    Then I read the edited data.

    your_data <- fread(edited_data, sep="|", sep2=NULL, nrows=-1L, header=T, na.strings="NA",
            stringsAsFactors=FALSE, verbose=FALSE, autostart=30L, skip=-1L, select=NULL,
            colClasses=colclasses)
    

    Everything worked like a charm. In case you have problems removing the tags, use this simple Python script (it will take some time for sure):

    original_file = file_path_to_original_file # e.g. "/Users/User/file.dat"
    edited_file = file_path_to_new_file # e.g. "/Users/User/file_edited.dat"
    
    with open(original_file) as inp:
        with open(edited_file, "w") as op:
            for line in inp:
                op.write(line.replace("<br/>", "")
    

    P.S.

    You can use read.table with similar optimizations, but it won't give you nearly as much speed.