Search code examples
pythonurllib2chunked-encodinghttplib

Python urllib2 decode chunked encoding


I have the following code to open and read URLs:

html_data = urllib2.urlopen(req).read()

and I believe this is the most standard way to read data from HTTP. However, when the response have chunked tranfer-encoding, the response starts with the following characters:

1eb0\r\n2625\r\n
<?xml version="1.0" encoding="UTF-8"?>
...

This happens due to the mentioned above chunked encoding and thus my XML data becomes corrupted.

So I wonder how I can get rid of all meta-data related to the chunked encoding?


Solution

  • I ended up with custom xml stripping, like this:

        xml_start = html_data.find('<?xml')
        xml_end = html_data.rfind('</mytag>')
        if xml_start !=0:
            log_user_action(req.get_host() ,'chunked data', html_data, {})
            html_data = html_data[xml_start:]
        if xml_end != len(html_data)-len('</mytag>')-1:
            html_data = html_data[:xml_end+1]
    

    Can't find any simple solution.