Search code examples
rflagsdbf

Reading .DBF files with deleted records in R


I want to read a .dbf file in R that contains "deleted" rows. However, I don´t know how to avoid reading all records. I am using the package "foreign" and the function read.dbf.

According to the .dbf file description, each record begins with a 1-byte "deletion" flag. The byte's value is a space (0x20), if the record is active, or an asterisk (0x2A), if the record is deleted.

How can I extract this information using R? E.g. for a small sample of the iris data set saved as a .dbf file:

library(foreign)
dbf.file <- 'iris.dbf'
write.dbf(iris[1:5, ], file=dbf.file)

Solution

  • We can use the readBin() function to read the .dbf file as binary data.

    # read binary data
    bin <- readBin(dbf.file, what='raw', n=file.size(dbf.file))
    

    Then, based on the .dbf format description, we can read information necessary to navigate ourselves to the first byte of each record. I use a custom function to convert the appropriate bytes from the .dbf header into an unsigned integer.

    bin_to_int <- function(bin) {
      if (.Platform$endian == 'big') {
        bin <- rev(bin)
      }
      sum(2^(which(as.logical(rawToBits(bin))) - 1))
    }
    
    # number of records
    n <- bin_to_int(bin[5:8])
    # numer of bytes in the header
    header.len <- bin_to_int(bin[9:10])
    # numer of bytes in the record
    record.len <- bin_to_int(bin[11:12])
    

    With these, it is possible to compute what are the records' first bytes and see if they mark the record as deleted or not.

    # first bytes of records
    record.starts <- header.len + 1 + (seq_len(n)-1) * record.len
    is.deleted <- bin[record.starts] == 0x2A
    is.deleted
    # [1] FALSE FALSE FALSE FALSE FALSE
    

    Indeed, none of the records were marked as deleted so we can at least check if the bytes hold the expected value of 0x20:

    bin[record.starts]
    # [1] 20 20 20 20 20
    

    On a side note, from documentation it is not clear how read.dbf() treats the deleted records so chances are it ignores them and you won't have to deal with this issue at all. It would be interesting to know this so please let us know in the comments.