Search code examples
rbinaryfiles

Is there a way to swap bytes to read binary DEC format?


I have old binary files written in what was called the 'DEC' format. In order to get the correct value for a 4 byte floating point from this format I can do the following:

  1. read the bytes
  2. swap the last two bytes and the first two bytes (swap word 1 and word 2)
  3. use readBin() to convert bytes to numeric
  4. divide this value by 4

I thought there would be a endian option [c('little', 'big', 'swap')] in readBin() that would take care of this but this does not seem to be the case. Here is an example and some code that shows the current workaround.

# Start with actual value from sample file:
# 4 bytes representing target value of 1.290
# in practice dec_bytes is read in by readBin(con, raw(), n=4)
dec_bytes <- writeBin(1.290, raw(), size=4)
# Now rearrange bytes swapping words
pc_bytes <- c(dec_bytes[3], dec_bytes[4], dec_bytes[1], dec_bytes[2])
# Now use readBin to give numeric value of bytes
pc_float <- readBin(pc_bytes, numeric(), n=1, size=4)
pc_float 
# [1] 0.5161456
# Now divide by 4 to get the correct answer
pc_float <- pc_float / 4
pc_float 
#[1] 0.1290364

I can obviously create a function to do this as listed above but, the actual the question is: Is there an easier and efficient way to do this? In some C code I either wrote or found about 30 years ago, I used the following function which I can only assume actually worked:

float ConvertDecToFloat(char bytes[4])
{
    char p[4];
    p[0] = bytes[2];
    p[1] = bytes[3];
    p[2] = bytes[0];
    p[3] = bytes[1];
    if (p[0] || p[1] || p[2] || p[3])
        --p[3];          // adjust exponent

    return *(float*)p;
}

So the --p[3] subtracts 1 from the last byte after rearranging which results in the correct answer without having to divide by 4. Not sure if this can be done in R without conversion to integer and back to byte.


Solution

  • Answered by a colleague (thanks to Michael Schwartz). Simple vectorized solution is to create a vector of indicies used to reorganize the byte vector values. I have two working solutions:

    # Test on a vector with 24 bytes, convert to 6 doubles of 4 bytes each
    values <- c(1, 12, 123, 1234, 12345, 123456)
    pc_bytes0 <- writeBin(values, raw(), size = 4)
    
    # Need to shuffle the byte order to reproduce DEC order
    # using same procedure we will use to unshuffle
    
    # Swapping needed to convert from PC to DEC byte order
    # DEC byte 1 -> 3, 2 -> 4, 3 >- 1, 4 -> 2
    byte_adjust <- rep(c(2, 2, -2, -2), 6) 
    # Original index order
    pc_byte_index <- seq(1:24) # original byte order
    # New index order for DEC data storage, add adjustment vector
    dec_byte_index <- pc_byte_index + byte_adjust
    # Now reshuffle the original data using the index to get the DEC order
    dec_bytes <- pc_data[dec_byte_index]
    # This what readBin(raw()) will return from DEC file, 
    # so actual process starts here.
    # Note: To get the true DEC byte array we would have to subtract 01 
    # from the 2nd byte in each 4 byte sequence
    
    # Approach 1, make a long vector of original byte order and another of offsets
    # and add together
    # Data is in DEC sequence, so make vector of original order
    dec_byte_index <- seq(1:24) # original byte index order
    # These are the index offsets needed
    byte_adjust <- rep(c(2, 2, -2, -2), 6)
    # Offset original order by adding 
    pc_byte_index <- dec_byte_index + byte_adjust
    # Apply PC byte order to data
    pc_bytes <- dec_bytes[pc_byte_index]
    # Now the data can by read in the correct order and correction applied
    pc_float <- readBin(pc_bytes, double(), n=6, size=4)
    pc_float 
    #> pc_float 
    #[1]      1     12    123   1234  12345 123456
    
    # Approach 2, use single index, reshape to matrix and apply 
    # index representing desired order of 4 original bytes
    byte_index <- c(3, 4, 1, 2)
    # Convert data to matrix 
    dec_byte_matrix <- matrix(dec_bytes, nrow=4, ncol=6)
    # Use indicies to swap
    pc_bytes <- dec_byte_matrix[index, ]
    # Now compute floats
    pc_float <- readBin(pc_bytes, double(), n=6, size=4)
    #> pc_float 
    #[1]      1     12    123   1234  12345 123456
    

    I tested with microbench and there is no discernable difference in processing time between these two. Note that with original DEC data pc_float needs to be divided by 4 to get the correct answer unless the byte adjustment is done instead.