I am gathering a set of CSVs for athletes using Requests and Python 2.7.
These files are being generated by MSFT Report Server and come through as iso-8859-1, says Requests.
Because I'm dealing with thousands every night, I want to sha256 the files and compare to the previous hash for the athlete. If the hash matches, I won't bother saving the file to disk. These files are small -- biggest is about 6K -- so no chunking / streaming issues.
sha256 is failing, however, because of a pesky BOM with these files. I've looked at 10 different "solutions" here and cannot find one that will pull the BOM out through decode.encode so that I can do my sha256.
One workaround, that I may have to revert to, is that I can write the file to disk and then sha256 it there. But that seems really bad form.
If I can strip out the BOM at the start, I'll have a process that works with sha256 and saves me from dealing with superfluous files.
I think the problem could be that I'm ostensibly trying string operations on what is a file object. But since the object is still a u"/... hex stream, I thought these ops would work...
Here are the details:
>>> r = requests.get('http://66.73.188.164/ReportServer?%2fCPTC%2fWomens1stHalfDetail&Team=&player=17424&rs:Format=CSV')
>>> r.status_code
200
>>> r.raw
<requests.packages.urllib3.response.HTTPResponse object at 0x18afb70>
>>> r.encoding
'ISO-8859-1'
>>> print r.headers['content-type']
text/plain
>>> r.text[0]
u'\xff'
First attempt to convert fails to decode using the indicated encoding type!
>>> z = r.text
>>> z.decode('iso-8859-1').encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
And in fact, the "type" of z is now different than expected, perhaps because of sys (mac; utf8)?
>>> type(z)
<type 'unicode'>
>>> z[0]
u'\xff'
>>> z[0:5]
u'\xff\xfem\x00a'
Various attempts to decode and encode have failed me; here is one of many such attempts.
>>> z.decode('utf-8-sig').encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/encodings/utf_8_sig.py", line 22, in decode
(output, consumed) = codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
I'm sure the answer is a one-liner; I'm just not seeing it. Any guidance most appreciated.
Maybe you could try omiting BOM by encoding only the rest of the file to get sha256? As in:
z = r.text[2:]
z.decode ...
The same logic would have to apply to the hashes of files already stored on the disk, but that shouldn't be a problem.