Search code examples
roptimizationbinaryfiles

Read binary structure in R


I have a simple binary structure with a few data types that repeats, which I need to efficiently read in R. For example, an integer icount, followed by a structure {a integer, b real} that repeats icount times. For example, consider this simple file written by Python:

# Python -- this is not my question, it just makes data for my question
from struct import pack
with open('foo.bin', 'wb') as fp:
    icount = 123456
    fp.write(pack('i', icount))
    for i in range(icount):
        fp.write(pack('if', i, i * 100.0))

(You can download this <1 MB file if you don't want to generate it.)

To read this file into R, I can use readBin in a for-loop, but it is painfully slow (as expected):

# R
fp <- file("foo.bin", "rb")
icount <- readBin(fp, "integer", size=4)
df <- data.frame(a=integer(icount), b=numeric(icount))
for (i in seq(icount)) {
    df$a[i] <- readBin(fp, "integer", size=4)
    df$b[i] <- readBin(fp, "numeric", size=4)
}
close(fp)

I would like to know of a more efficient method to read a non-uniform binary structure into a data.frame structure (or similar). I know that the for-loops should always be avoided, if possible.


Solution

  • I found a fast-running workaround, which is to read the whole block of structure data as "raw", then slice the parts out to interpret the structure. Let me demonstrate:

    fp <- file("foo.bin", "rb")
    icount <- readBin(fp, "integer", size=4)
    rec_size = 4 + 4  # int is 4 bytes + float is 4 bytes
    raw <- readBin(fp, "raw", n=icount * rec_size)
    close(fp)
    
    # Interpret raw bytes using specifically tailored slices for the structure
    raw_sel_a <- rep(0:icount, each=4) * rec_size + 1:4
    raw_sel_b <- rep(0:icount, each=4) * rec_size + 1:4 + 4
    df <- data.frame(
        a = readBin(raw[raw_sel_a], "integer", size=4, n=icount),
        b = readBin(raw[raw_sel_b], "numeric", size=4, n=icount))
    

    The tricky part is making the raw_sel to slice the relevant parts of the raw structure to read. This is simple for this example, since each data member are 4 bytes. However, I could imagine this being more difficult with complex data structures.