Search code examples
pythoncsvpython-requestsbyte-order-mark

Python requests, CSV, Sha256 and BOM


I am gathering a set of CSVs for athletes using Requests and Python 2.7.

These files are being generated by MSFT Report Server and come through as iso-8859-1, says Requests.

Because I'm dealing with thousands every night, I want to sha256 the files and compare to the previous hash for the athlete. If the hash matches, I won't bother saving the file to disk. These files are small -- biggest is about 6K -- so no chunking / streaming issues.

sha256 is failing, however, because of a pesky BOM with these files. I've looked at 10 different "solutions" here and cannot find one that will pull the BOM out through decode.encode so that I can do my sha256.

One workaround, that I may have to revert to, is that I can write the file to disk and then sha256 it there. But that seems really bad form.

If I can strip out the BOM at the start, I'll have a process that works with sha256 and saves me from dealing with superfluous files.

I think the problem could be that I'm ostensibly trying string operations on what is a file object. But since the object is still a u"/... hex stream, I thought these ops would work...

Here are the details:

>>> r = requests.get('http://66.73.188.164/ReportServer?%2fCPTC%2fWomens1stHalfDetail&Team=&player=17424&rs:Format=CSV')
>>> r.status_code
200
>>> r.raw
<requests.packages.urllib3.response.HTTPResponse object at 0x18afb70>
>>> r.encoding
'ISO-8859-1'
>>> print r.headers['content-type']
text/plain
>>> r.text[0]
u'\xff'

First attempt to convert fails to decode using the indicated encoding type!

>>> z = r.text
>>> z.decode('iso-8859-1').encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

And in fact, the "type" of z is now different than expected, perhaps because of sys (mac; utf8)?

>>> type(z)
<type 'unicode'>
>>> z[0]
u'\xff'
>>> z[0:5]
u'\xff\xfem\x00a'

Various attempts to decode and encode have failed me; here is one of many such attempts.

>>> z.decode('utf-8-sig').encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/encodings/utf_8_sig.py", line 22, in decode
(output, consumed) = codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

I'm sure the answer is a one-liner; I'm just not seeing it. Any guidance most appreciated.


Solution

  • Maybe you could try omiting BOM by encoding only the rest of the file to get sha256? As in:

    z = r.text[2:]
    z.decode ...
    

    The same logic would have to apply to the hashes of files already stored on the disk, but that shouldn't be a problem.