Search code examples
pythondbf

How could I improve this Python script to replace records in dbf file?


I have a dbf file with about 9 million records and 2.5 GB size. A lot of space is taken up by a 80 size character field used to store 1 of about 10 different character strings. To save on file size, I want to replace the character field by an integer field and use a relational database at a later stage to get the full character field if needed.

Currently I have the following Python script which uses the dbf library (http://pythonhosted.org/dbf/). The script seems to be working (tested on a smaller dbf file), but it runs for several hours when I try to run it with the full dbf file.

import dbf

tabel = dbf.Db3Table('dataset.dbf')
tabel.open()

with tabel:
 tabel.add_fields('newfield N(2, 0)')
 for record in tabel:
     if record.oldfield == 'string_a                                                                        ':
         dbf.write(record, newfield=1)
     elif record.oldfield == 'string_b                                                                        ':
         dbf.write(record, newfield=2)
     elif record.oldfield == 'string_c                                                                        ':
         dbf.write(record, newfield=3)
     elif record.oldfield == 'string_d                                                                        ':
         dbf.write(record, newfield=4)
     elif record.oldfield == 'string_e                                                                        ':
         dbf.write(record, newfield=5)
     elif record.oldfield == 'string_f                                                                        ':
         dbf.write(record, newfield=6)
     elif record.oldfield == 'string_g                                                                        ':
         dbf.write(record, newfield=7)
     elif record.oldfield == 'string_h                                                                        ':
         dbf.write(record, newfield=8)
     elif record.oldfield == 'string_i                                                                        ':
         dbf.write(record, newfield=9)
     elif record.oldfield == 'string_j                                                                        ':
         dbf.write(record, newfield=10)
     else:
         dbf.write(record, newfield=0)

dbf.delete_fields('dataset.dbf', 'oldfield')

As you may be able to see from the code, I am new to both Python and the dbf library. Can this script be made to run more efficiently?


Solution

  • Adding and deleting fields will both first make a backup copy of your 2.5GB file.

    Your best bet is to make a new dbf with the same structure as the original, with the exception of those two fields; then as you copy each record make the changes. Something like:

    # lightly untested
    
    old_table = dbf.Table('old_table.dbf')
    structure = old_table.structure()
    old_field_index = structure.index('oldfield')
    structure = structure[:old_field_index] + structure[old_field_index+1:]
    structure.append('newfield N(2,0)')
    new_table = dbf.Table('new_name_here.dbf', structure)
    
    with dbf.Tables(old_table, new_table):
        for rec in old_table:
            rec = list(rec)
            old_value = rec.pop(old_field_index)
            rec.append(<transform old_value into new_value>)
            new_table.append(tuple(rec))