I have a simple binary structure with a few data types that repeats, which I need to efficiently read in R. For example, an integer icount
, followed by a structure {a integer, b real}
that repeats icount
times. For example, consider this simple file written by Python:
# Python -- this is not my question, it just makes data for my question
from struct import pack
with open('foo.bin', 'wb') as fp:
icount = 123456
fp.write(pack('i', icount))
for i in range(icount):
fp.write(pack('if', i, i * 100.0))
(You can download this <1 MB file if you don't want to generate it.)
To read this file into R, I can use readBin
in a for-loop, but it is painfully slow (as expected):
# R
fp <- file("foo.bin", "rb")
icount <- readBin(fp, "integer", size=4)
df <- data.frame(a=integer(icount), b=numeric(icount))
for (i in seq(icount)) {
df$a[i] <- readBin(fp, "integer", size=4)
df$b[i] <- readBin(fp, "numeric", size=4)
}
close(fp)
I would like to know of a more efficient method to read a non-uniform binary structure into a data.frame
structure (or similar). I know that the for-loops should always be avoided, if possible.
I found a fast-running workaround, which is to read the whole block of structure data as "raw", then slice the parts out to interpret the structure. Let me demonstrate:
fp <- file("foo.bin", "rb")
icount <- readBin(fp, "integer", size=4)
rec_size = 4 + 4 # int is 4 bytes + float is 4 bytes
raw <- readBin(fp, "raw", n=icount * rec_size)
close(fp)
# Interpret raw bytes using specifically tailored slices for the structure
raw_sel_a <- rep(0:icount, each=4) * rec_size + 1:4
raw_sel_b <- rep(0:icount, each=4) * rec_size + 1:4 + 4
df <- data.frame(
a = readBin(raw[raw_sel_a], "integer", size=4, n=icount),
b = readBin(raw[raw_sel_b], "numeric", size=4, n=icount))
The tricky part is making the raw_sel
to slice the relevant parts of the raw structure to read. This is simple for this example, since each data member are 4 bytes. However, I could imagine this being more difficult with complex data structures.