Search code examples
pythonrbinaryfiles

R readBin vs. Python struct


I am attempting to read a binary file using Python. Someone else has read in the data with R using the following code:

x <- readBin(webpage, numeric(), n=6e8, size = 4, endian = "little")
      myPoints <- data.frame("tmax" = x[1:(length(x)/4)],
                             "nmax" = x[(length(x)/4 + 1):(2*(length(x)/4))],
                             "tmin" = x[(2*length(x)/4 + 1):(3*(length(x)/4))],
                             "nmin" = x[(3*length(x)/4 + 1):(length(x))])

With Python, I am trying the following code:

import struct

with open('file','rb') as f:
    val = f.read(16)
    while val != '':
        print(struct.unpack('4f', val))
        val = f.read(16) 

I am coming to slightly different results. For example, the first row in R returns 4 columns as -999.9, 0, -999.0, 0. Python returns -999.0 for all four columns (images below).

Python output: enter image description here

R output: enter image description here

I know that they are slicing by the length of the file with some of the [] code, but I do not know how exactly to do this in Python, nor do I understand quite why they do this. Basically, I want to recreate what R is doing in Python.

I can provide more of either code base if needed. I did not want to overwhelm with code that was not necessary.


Solution

  • Here's a less memory-hungry way to do the same. It possibly is a bit faster too. (but that is difficult to check for me)

    My computer did not have sufficient memory to run the first program with those huge files. This one does, but I still needed to create a list of ony tmax's first (the first 1/4 of the file), then print it, and then delete the list in order to have enough memory for nmax's, tmin's and nmin's.

    But this one too says the nmin's inside the 2018 file are all -999.0. If that doesn't make sense, could you check what the R-code makes of it then? I suspect that it is just what's in the file. The other possibility is of course, that I got it all wrong (which I doubt). However, I tried the 2017 file too, and that one does not have such problem: all of tmax, nmax, tmin, nmin have around 37% -999.0 's.

    Anyway, here's the second code:

    import os
    import struct
    
    # load_data()
    #   data_store : object to append() data items (floats) to
    #   num        : number of floats to read and store
    #   datafile   : opened binary file object to read float data from
    #
    def load_data(data_store, num, datafile):
        for i in range(num):
            data = datafile.read(4)  # process one float (=4 bytes) at a time
            item = struct.unpack("<f", data)[0]  # '<' means little endian
            data_store.append(item) 
    
    # save_list() saves a list of float's as strings to a file
    #
    def save_list(filename, datalist):
        output = open(filename, "wt")
        for item in datalist:
            output.write(str(item) + '\n')
        output.close()
    
    #### MAIN ####
    
    datafile = open('data.bin','rb')
    
    # Get file size so we can calculate number of points without reading
    # the (large) file entirely into memory.
    #
    file_info = os.stat(datafile.fileno())
    
    # Calculate number of points, i.e. number of each tmax's, nmax's,
    # tmin's, nmin's. A point is 4 floats of 4 bytes each, hence number
    # of points = file-size / (4*4)
    #
    num = int(file_info.st_size / 16)
    
    tmax_list = list()
    load_data(tmax_list, num, datafile)
    save_list("tmax.txt", tmax_list)
    del tmax_list   # huge list, save memory
    
    nmax_list = list()
    load_data(nmax_list, num, datafile)
    save_list("nmax.txt", nmax_list)
    del nmax_list   # huge list, save memory
    
    tmin_list = list()
    load_data(tmin_list, num, datafile)
    save_list("tmin.txt", tmin_list)
    del tmin_list   # huge list, save memory
    
    nmin_list = list()
    load_data(nmin_list, num, datafile)
    save_list("nmin.txt", nmin_list)
    del nmin_list   # huge list, save memory