Search code examples
python-2.7character-encodingpyodbcverticabase32

How to count characters instead of bytes?


I have some uuids stored in a database as base32 encoded strings without the padding. They are 26 characters in length. I am trying to extract them in Python 2.7.5 and convert them into binary data for a different data store. The problem arises with my Python DB utility interpreting these base32 strings as unicode with 2 bytes per character. Here is the code:

str = row.uuid
print type(str)
print "Padding {0} with length {1}, mod 8 is {2}".format(s, len(s), len(s) % 8)
str = str.ljust(int(math.ceil(len(str) / 8.0) * 8), '=')
print str
uuidbytes = base64.b32decode(str)
row.couponUuid = uuid.UUID(bytes=uuidbytes)

The output is this:

<type 'unicode'>
Padding ANEMTUTPUZFZFH6ANXNW5IOI4U with length 52, mod 8 is 4
ANEMTUTPUZFZFH6ANXNW5IOI4U====
File "path/to/my/script.py", line 143
    uuidbytes = base64.b32decode(str)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/base64.py", line 222, in b32decode
    raise TypeError('Non-base32 digit found')
TypeError: Non-base32 digit found

And the docs say the TypeError can be caused by incorrect padding. As you can see, the string in question has 26 characters, not 52, and as such only gets 4 ='s for padding instead of the 6 it requires.

If I try this in the console with pasting in the same string, it works, even if I prefix the string literal with a u. What transformation or method can I call to make len return the correct character count? I tried to normalize and encode it with the following code, but it still reported the same length and returned the same padding.

unicodedata.normalize('NFKD', row.couponUuid).encode('ascii', 'ignore')

Trying the more simple encode trick provided by @Ignacio doesn't cut it either

str = row.couponUuid.encode('latin-1', 'replace')
print "Padding {0} with length {1}, mod 8 is {2}".format(s, len(s), len(s) % 8)
str = str.ljust(int(math.ceil(len(str) / 8.0) * 8), '=')

With either 'replace' or 'ingore', it still prints: Padding ANEMTUTPUZFZFH6ANXNW5IOI4U with length 52, mod 8 is 4

Additional Information as requested by @dano:

print repr(row.uuid) shows the unicode encoding of the string:

u'A\x00N\x00E\x00M\x00T\x00U\x00T\x00P\x00U\x00Z\x00F\x00Z\x00F\x00H\x006\x00A\x00N\x00X\x00N\x00W\x005\x00I\x00O\x00I\x004\x00U\x00'

The database this is being pulled from is Vertica (I think in the 7.x family). I'm not sure what its character set is, but the column type is VARCHAR(26). It's being pulled out of the database by PyODBC connection. I am not specifically encoding or decoding the data anywhere in my code. The Vertica database is populated by a different code base, I just have to pull it out with Python.

Here is everything Vertica can tell me about the table column:

TABLE_CAT         reporting
TABLE_SCHEM       reporting_master
TABLE_NAME        rmn_coupon
COLUMN_NAME       uuid
DATA_TYPE         12
TYPE_NAME         Varchar
COLUMN_SIZE       26
BUFFER_LENGTH     26
DECIMAL_DIGITS    (null)
NUM_PREC_RADIX    (null)
NULLABLE          1
REMARKS           (null)
COLUMN_DEF  
SQL_DATA_TYPE     12
SQL_DATETIME_SUB  (null)
CHAR_OCTET_LENGTH 26
ORDINAL_POSITION  2
IS_NULLABLE       YES
SCOPE_CATALOG     (null)
SCOPE_SCHEMA      (null)
SCOPE_TABLE       (null)
SOURCE_DATA_TYPE  (null)

Solution

  • So taking the obvious approach of replacing the spare null bytes seems to do the trick. (sigh)

    print repr(str)
    str = str.replace('\x00', '')
    print repr(str)
    str = str.ljust(int(math.ceil(len(str) / 8.0) * 8), '=')
    print repr(str)
    

    Shows this output:

    u'A\x00N\x00E\x00M\x00T\x00U\x00T\x00P\x00U\x00Z\x00F\x00Z\x00F\x00H\x006\x00A\x00N\x00X\x00N\x00W\x005\x00I\x00O\x00I\x004\x00U\x00'
    u'ANEMTUTPUZFZFH6ANXNW5IOI4U'
    u'ANEMTUTPUZFZFH6ANXNW5IOI4U======'
    

    Where the last line is a correctly padded base32 string.

    This question on a popped up on a google search for '\x00 python' and gave me the hint.

    As Ignacio points out in comments above, this can also be solved by using the right encoding and decoding. I'm not sure how you can tell what the right coding and encoding are, but Ignacio's UTF-16LE does the trick.

    str = str.encode('latin-1').decode('utf-16le')