Search code examples
python-2.7dbfdbase

BadDataError when editing a .dbf file using dbf package


I have recently produced several thousand shapefile outputs and accompanying .dbf files from an atmospheric model (HYSPLIT) on a unix system. The converter txt2dbf is used to convert shapefile attribute tables (text file) to a .dbf.

Unfortunately, something has gone wrong (probably a separator/field length error) because there are 2 problems with the output .dbf files, as follows:

  1. Some fields of the dbf contain data that should not be there. This data has "spilled over" from neighbouring fields.
  2. An additional field has been added that should not be there (it actually comes from a section of the first record of the text file, "1000 201").

This is an example of the first record in the output dbf (retrieved using dbview unix package):

Trajnum : 1001 2
Yyyymmdd : 0111231 2
Time : 300
Level : 0.
1000 201:

Here's what I expected:

Trajnum : 1000
Yyyymmdd : 20111231
Time : 2300
Level : 0.

Separately, I'm looking at how to prevent this from happening again, but ideally I'd like to be able to repair the existing .dbf files. Unfortunately the text files are removed for each model run, so "fixing" the .dbf files is the only option.

My approaches to the above problems are:

  1. Extract the information from the fields that do exist to a new variable using dbf.add_fields and dbf.write (python package dbf), then delete the old incorrect fields using dbf.delete_fields.
  2. Delete the unwanted additional field.

This is what I've tried:

        with dbf.Table(db) as db:
            db.add_fields("TRAJNUMc C(4)") #create new fields
            db.add_fields("YYYYMMDDc C(8)")
            db.add_fields("TIMEc C(4)")
            for record in db: #extract data from fields
                    dbf.write(TRAJNUMc=int(str(record.Trajnum)[:4]))
                    dbf.write(YYYYMMDDc=int(str(record.Trajnum)[-1:] + str(record.Yyyymmdd)[:7]))
                    dbf.write(TIMEc=record.Yyyymmdd[-1:] + record.Time[:])
            db.delete_fields('Trajnum') # delete the incorrect fields
            db.delete_fields('Yyyymmdd')
            db.delete_fields('Time')
            db.delete_fields('1000 201') #delete the unwanted field
            db.pack()

But this produces the following error:

dbf.ver_2.BadDataError: record data is not the correct length (should be 31, not 30)

Given the apparent problem that there has been with the txt2dbf conversion, I'm not surprised to find an error in the record data length. However, does this mean that the file is completely corrupted and that I can't extract the information that I need (frustrating because I can see that it exists)?


EDIT:

Rather than attempting to edit the 'bad' .dbf files, it seems a better approach to 1. extract the required data to a text from the bad files and then 2. write to a new dbf. (See Ethan Furman's comments/answer below).


EDIT:

An example of a faulty .dbf file that I need to fix/recover data from can be found here:

https://www.dropbox.com/s/9y92f7m88a8g5y4/p0001120110.dbf?dl=0

An example .txt file from which the faulty dbf files were created can be found here:

https://www.dropbox.com/s/d0f2c0zehsyy8ab/attTEST.txt?dl=0


Solution

  • To fix the data and recreate the original text file, this snippet should help:

    import dbf
    
    table = dbf.Table('/path/to/scramble/table.dbf')
    with table:
        fixed_data = []
        for record in table:
            # convert to str/bytes while skipping delete flag
            data = record._data[1:].tostring()
            trajnum = data[:4]
            ymd = data[4:12]
            time = data [12:16]
            level = data[16:].strip()
            fixed_data.extend([trajnum, ymd, time, level])
    
    new_file = open('repaired_data.txt', 'w')
    for line in fixed_data:
        new_file.write(','.join(line) + '\n')
    

    Assuming all your data files look like your sample (the big IF being the data has no embedded commas), then this rough code should help translate your text files into dbfs:

    raw_data = open('some_text_file.txt').read().split('\n')
    final_table = dbf.Table(
            'dest_table.dbf',
            'trajnum C(4); yyyymmdd C(8); time C(4); level C(9)',
            )
    with final_table:
        for line in raw_data:
            fields = line.split(',')
            final_table.append(tuple(fields))
    
    # table has been populated and closed
    

    Of course, you could get fancier and use actual date, and number fields if you want to:

    # dbf string becomes
    'trajnum N; yyyymmdd D; time C(4), level N'
    
    #appending data loop becomes
        for line in raw_data:
            trajnum, ymd, time, level = line.split(',')
            trajnum = int(trajnum)
            ymd = dbf.Date(ymd[:4], ymd[4:6], ymd[6:])
            level = int(level)
            final_table.append((trajnum, ymd, time, level))