Search code examples
pythonpython-3.xdbfpython-3.9

UnicodeDecodeError when reading data from DBF database


I need to write a script that connects an ERP program to a manufacturing program. With the production program the matter is clear - I send it data via HTTP requests. It is worse with the ERP program, because in its case, the data must be read from a DBF file.

I use the dbf library because (if I'm not mistaken) it's the only one that provides the ability to filter data in a fairly simple and fast way. I open the database this way

table = dbf.Table(path).open()
dbf_index = dbf.pql(table, "select * where ident == 'M'")

I then loop through each successive record that the query returned. I need to "package" the selected data from the DBF database into json and send it to the production program api.

data = {
    "warehouse_id" : parseDbfData(record['SYMBOL']),
    "code" : parseDbfData(record['SYMBOL']),
    "name" : parseDbfData(record['NAZWA']),
    "main_warehouse" : False,
    "blocked" : False
}

The parseDbfData function looks like this, but it's not the one causing the problem because it didn't work the same way without it. I added it trying to fix the problem.

def parseDbfData(data):
    return str(data.strip())

When run, if the function encounters any "mismatching" character from DBF database (e.g. any Polish characters i.e. ą, ę, ś, ć) the script terminates with an error

UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 15: ordinal not in range(128)

The error points to a line containing this (in building json)

"name" : parseDbfData(record['NAZWA']),

The value the script is trying to read at this point is probably "Magazyn materiałów Podgórna". As you can see, this value contains the characters "ł" and "ó". I think this makes the whole script break but I don't know how to fix it.

I'll mention that I'm using Python version 3.9. I know that there were character encoding issues in versions 2., but I thought that the Python 3. era had remedied that problem. I found out it didn't :(


Solution

  • I came to the conclusion that I have to use encoding directly when reading the DBF database. However, I could not read from the documentation, how exactly to do this.

    After a thorough analysis of the dbf module itself, I came to the conclusion that I need to use the codepage parameter when opening the database. A moment of combining and I was able to determine that of all the encoding standards available in the module, cp852 suits me best.

    After the correction, the code to open a DBF database looks like this:

    table = dbf.Table(path, codepage='cp852').open()