Search code examples
rreadlines

Read and parse an irregular and mixed ASCII file into R


I'm still very new to R and I apologize if I'm not using the proper terminology. I'm interested in pulling a large amount of Unemployment Insurance Trust Fund data from the Treasury Direct online report query system (http://www.treasurydirect.gov/govt/reports/tfmp/tfmp_utf.htm) and I've successfully pulled the information using readLines.

ESAA_OCT15 <- readLines('http://www.treasurydirect.gov/govt/reports/tfmp/utf/es/dfiw01015tses.txt')

which gives me the chart as a string vector.

Is there a way to then parse the lines and turn it into a data frame so I can at least put it excel and easily get the important information out of it? I'm sure there is another way to do this as well but the reports will always vary in what accounting code sections are included and how many individual transactions are included so I'm not even sure where to begin with that.

The items I need are the date, the share/par (dollar transaction amount), the transaction code, and the transaction description. The totals would be useful but are by no means necessary.

When you look at it using Excel it looks like enter image description here


Solution

  • This will help you parse the information:

    ESAA_OCT15 <- readLines('http://www.treasurydirect.gov/govt/reports/tfmp/utf/es/dfiw01015tses.txt')
    # Select lines with /
    z = grepl(pattern = "/",x = ESAA_OCT15)
    d = trimws(ESAA_OCT15[z])
    
    dates = substr(d,0,10)
    sharesPar = substr(d,11,41)
    

    What this does is first select all lines that contain a / character. This will even return the column titles. These are stored in d.

    If you examine d:

    [1] "Effective Date                 Shares/Par  Description Code           Memo Number    Code      Account Number"
     [2] "10/01/2015                 2,313,000.0000  12-10 FUTA RECEIPTS         3305617                 ESAA"          
     [3] "10/01/2015                 3,663,000.0000  12-10 FUTA RECEIPTS         3305618                 ESAA"          
     [4] "10/02/2015                 4,314,000.0000  12-10 FUTA RECEIPTS         3305640                 ESAA"          
     [5] "10/05/2015                 3,512,000.0000  12-10 FUTA RECEIPTS         3305662                 ESAA"
    

    The information is aligned neatly. This means that the data of each column ends at a precise position. To parse this you can use substr with start and stop as shown in my script.

    Of course, I did not complete all parses, I'll let you finish the rest. Once each column is parsed, create a data.frame(dates, sharesPar, ...)