I have some uuids stored in a database as base32 encoded strings without the padding. They are 26 characters in length. I am trying to extract them in Python 2.7.5 and convert them into binary data for a different data store. The problem arises with my Python DB utility interpreting these base32 strings as unicode with 2 bytes per character. Here is the code:
str = row.uuid
print type(str)
print "Padding {0} with length {1}, mod 8 is {2}".format(s, len(s), len(s) % 8)
str = str.ljust(int(math.ceil(len(str) / 8.0) * 8), '=')
print str
uuidbytes = base64.b32decode(str)
row.couponUuid = uuid.UUID(bytes=uuidbytes)
The output is this:
<type 'unicode'>
Padding ANEMTUTPUZFZFH6ANXNW5IOI4U with length 52, mod 8 is 4
ANEMTUTPUZFZFH6ANXNW5IOI4U====
File "path/to/my/script.py", line 143
uuidbytes = base64.b32decode(str)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/base64.py", line 222, in b32decode
raise TypeError('Non-base32 digit found')
TypeError: Non-base32 digit found
And the docs say the TypeError can be caused by incorrect padding. As you can see, the string in question has 26 characters, not 52, and as such only gets 4 =
's for padding instead of the 6 it requires.
If I try this in the console with pasting in the same string, it works, even if I prefix the string literal with a u
. What transformation or method can I call to make len
return the correct character count? I tried to normalize and encode it with the following code, but it still reported the same length and returned the same padding.
unicodedata.normalize('NFKD', row.couponUuid).encode('ascii', 'ignore')
Trying the more simple encode trick provided by @Ignacio doesn't cut it either
str = row.couponUuid.encode('latin-1', 'replace')
print "Padding {0} with length {1}, mod 8 is {2}".format(s, len(s), len(s) % 8)
str = str.ljust(int(math.ceil(len(str) / 8.0) * 8), '=')
With either 'replace'
or 'ingore'
, it still prints: Padding ANEMTUTPUZFZFH6ANXNW5IOI4U with length 52, mod 8 is 4
Additional Information as requested by @dano:
print repr(row.uuid)
shows the unicode encoding of the string:
u'A\x00N\x00E\x00M\x00T\x00U\x00T\x00P\x00U\x00Z\x00F\x00Z\x00F\x00H\x006\x00A\x00N\x00X\x00N\x00W\x005\x00I\x00O\x00I\x004\x00U\x00'
The database this is being pulled from is Vertica (I think in the 7.x family). I'm not sure what its character set is, but the column type is VARCHAR(26)
. It's being pulled out of the database by PyODBC connection. I am not specifically encoding or decoding the data anywhere in my code. The Vertica database is populated by a different code base, I just have to pull it out with Python.
Here is everything Vertica can tell me about the table column:
TABLE_CAT reporting
TABLE_SCHEM reporting_master
TABLE_NAME rmn_coupon
COLUMN_NAME uuid
DATA_TYPE 12
TYPE_NAME Varchar
COLUMN_SIZE 26
BUFFER_LENGTH 26
DECIMAL_DIGITS (null)
NUM_PREC_RADIX (null)
NULLABLE 1
REMARKS (null)
COLUMN_DEF
SQL_DATA_TYPE 12
SQL_DATETIME_SUB (null)
CHAR_OCTET_LENGTH 26
ORDINAL_POSITION 2
IS_NULLABLE YES
SCOPE_CATALOG (null)
SCOPE_SCHEMA (null)
SCOPE_TABLE (null)
SOURCE_DATA_TYPE (null)
So taking the obvious approach of replacing the spare null bytes seems to do the trick. (sigh)
print repr(str)
str = str.replace('\x00', '')
print repr(str)
str = str.ljust(int(math.ceil(len(str) / 8.0) * 8), '=')
print repr(str)
Shows this output:
u'A\x00N\x00E\x00M\x00T\x00U\x00T\x00P\x00U\x00Z\x00F\x00Z\x00F\x00H\x006\x00A\x00N\x00X\x00N\x00W\x005\x00I\x00O\x00I\x004\x00U\x00'
u'ANEMTUTPUZFZFH6ANXNW5IOI4U'
u'ANEMTUTPUZFZFH6ANXNW5IOI4U======'
Where the last line is a correctly padded base32 string.
This question on a popped up on a google search for '\x00
python' and gave me the hint.
As Ignacio points out in comments above, this can also be solved by using the right encoding and decoding. I'm not sure how you can tell what the right coding and encoding are, but Ignacio's UTF-16LE does the trick.
str = str.encode('latin-1').decode('utf-16le')