Search code examples
rdata.tabledelimiterspaceread.table

Reading text file with varying column width but fixed delimiter in R


I have multiple .txt files which look like this:

header
header
header
header
header
01130009.JPG   JPEG         2/5/2018 3:53:44 PM   G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg   Gray Fox                                                                           
01130009.JPG   JPEG         2/5/2018 3:53:44 PM   G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg   Direct Register Walk, Gait, Gray Fox, Stop                                         
01130009.JPG   JPEG         2/5/2018 3:53:44 PM   G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg   Gray Fox   

The width of the last 2 columns varies, but there is always 3 spaces between all the columns (3rd column is empty in this case).

I'm using this code to read in the example .txt:

read.fwf(filename.txt,skip=5,widths=c(12,16,19,76,83),fill=T,fileEncoding = "UTF-16")

But this code won't work properly on this .txt:

header
header
header
header
header
01130009.JPG   JPEG         2/5/2018 3:53:44 PM   G:\AAA AAAAAAAA\AAAAA AA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther DowngBBB   Gray Fox                                                                           
01130009.JPG   JPEG         2/5/2018 3:53:44 PM   G:\AAA AAAAAAAA\AAAAA AA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther DowngBBB   Direct Register Walk, Gait, Gray Fox, Stop                                         
01130009.JPG   JPEG         2/5/2018 3:53:44 PM   G:\AAA AAAAAAAA\AAAAA AA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther DowngBBB   Gray Fox   

Is there a way to read in a .txt file with a fixed deliminator (3 spaces) instead of having to define the width of each column, since the column width varies between files.

The files also have some issues with encoding, so here is the example file I'm using


Solution

  • I don't know if there are good tools that look for multi-char delimiters, and you aren't the first to ask about it. Most (incl read.table, read.delim, and readr::read_delim) require a single-byte separator.

    One method, though certainly not efficient for large files, is to load them in line-wise and do the splitting yourself.

    (Consumable data that the bottom.)

    x <- readLines(textConnection(file1))
    x <- x[x != 'header'] # or x <- x[-(1:5)]
    

    (I'm guessing it isn't always the literal header, so I'm assuming it's either a fixed count or you can easily "know" which is which.)

    spl <- strsplit(x, '   ')
    str(spl)
    # List of 3
    #  $ : chr [1:31] "01130009.JPG" "JPEG" "" "" ...
    #  $ : chr [1:20] "01130009.JPG" "JPEG" "" "" ...
    #  $ : chr [1:7] "01130009.JPG" "JPEG" "" "" ...
    

    This seems ok, except that in your examples, there are lots of blanks on the right ...

    spl[[1]]
    #  [1] "01130009.JPG"                                                                
    #  [2] "JPEG"                                                                        
    #  [3] ""                                                                            
    #  [4] ""                                                                            
    #  [5] "2/5/2018 3:53:44 PM"                                                         
    #  [6] "G:\\AAA AAAAAAAA\\AAAAA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther Downg"
    #  [7] "Gray Fox"                                                                    
    #  [8] ""                                                                            
    #  [9] ""                                                                            
    # [10] ""                                                                            
    # [11] ""                                                                            
    # [12] ""                                                                            
    # [13] ""                                                                            
    # [14] ""                                                                            
    # [15] ""                                                                            
    # [16] ""                                                                            
    # [17] ""                                                                            
    # [18] ""                                                                            
    # [19] ""                                                                            
    # [20] ""                                                                            
    # [21] ""                                                                            
    # [22] ""                                                                            
    # [23] ""                                                                            
    # [24] ""                                                                            
    # [25] ""                                                                            
    # [26] ""                                                                            
    # [27] ""                                                                            
    # [28] ""                                                                            
    # [29] ""                                                                            
    # [30] ""                                                                            
    # [31] ""                                                                            
    

    So if you know how many columns there are, then you can easily remove extras:

    spl <- lapply(spl, `[`, 1:7)
    

    and then check the output:

    as.data.frame(do.call(rbind, spl), stringsAsFactors = FALSE)
    #             V1   V2 V3 V4                  V5
    # 1 01130009.JPG JPEG       2/5/2018 3:53:44 PM
    # 2 01130009.JPG JPEG       2/5/2018 3:53:44 PM
    # 3 01130009.JPG JPEG       2/5/2018 3:53:44 PM
    #                                                                             V6
    # 1 G:\\AAA AAAAAAAA\\AAAAA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther Downg
    # 2 G:\\AAA AAAAAAAA\\AAAAA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther Downg
    # 3 G:\\AAA AAAAAAAA\\AAAAA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther Downg
    #                                           V7
    # 1                                   Gray Fox
    # 2 Direct Register Walk, Gait, Gray Fox, Stop
    # 3                                   Gray Fox
    

    This works equally well with your second example:

    x <- readLines(textConnection(file2))
    x <- x[x != 'header'] # or x <- x[-(1:5)]
    spl <- lapply(strsplit(x, '   '), `[`, 1:7)
    as.data.frame(do.call(rbind, spl), stringsAsFactors = FALSE)
    #             V1   V2 V3 V4                  V5
    # 1 01130009.JPG JPEG       2/5/2018 3:53:44 PM
    # 2 01130009.JPG JPEG       2/5/2018 3:53:44 PM
    # 3 01130009.JPG JPEG       2/5/2018 3:53:44 PM
    #                                                                                   V6
    # 1 G:\\AAA AAAAAAAA\\AAAAA AA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther DowngBBB
    # 2 G:\\AAA AAAAAAAA\\AAAAA AA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther DowngBBB
    # 3 G:\\AAA AAAAAAAA\\AAAAA AA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther DowngBBB
    #                                           V7
    # 1                                   Gray Fox
    # 2 Direct Register Walk, Gait, Gray Fox, Stop
    # 3                                   Gray Fox
    

    Consumable data:

    # note: replaced single '\' with double '\\' for R string-handling only
    file1 <- 'header
    header
    header
    header
    header
    01130009.JPG   JPEG         2/5/2018 3:53:44 PM   G:\\AAA AAAAAAAA\\AAAAA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther Downg   Gray Fox                                                                           
    01130009.JPG   JPEG         2/5/2018 3:53:44 PM   G:\\AAA AAAAAAAA\\AAAAA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther Downg   Direct Register Walk, Gait, Gray Fox, Stop                                         
    01130009.JPG   JPEG         2/5/2018 3:53:44 PM   G:\\AAA AAAAAAAA\\AAAAA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther Downg   Gray Fox   '
    file2 <- 'header
    header
    header
    header
    header
    01130009.JPG   JPEG         2/5/2018 3:53:44 PM   G:\\AAA AAAAAAAA\\AAAAA AA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther DowngBBB   Gray Fox                                                                           
    01130009.JPG   JPEG         2/5/2018 3:53:44 PM   G:\\AAA AAAAAAAA\\AAAAA AA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther DowngBBB   Direct Register Walk, Gait, Gray Fox, Stop                                         
    01130009.JPG   JPEG         2/5/2018 3:53:44 PM   G:\\AAA AAAAAAAA\\AAAAA AA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther DowngBBB   Gray Fox   '