Search code examples
pythoncurlurlrequestbinary

Problem decoding content from httplib2 request in Python


Good Morning,

I have the following code which uses httplib2 to get content from an URL:

from __future__ import unicode_literals
import httplib2
import requests
import subprocess
from bs4 import BeautifulSoup

def initialize():
    global url
    url = "http://nottherealurl.com"
    global header
    header = set_header()
def set_header():
    return {
    "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
    "Accept-Encoding":"gzip, deflate, br",  
    "Content-Type":"text/html; charset=utf-8",
    "Accept-Language":"en-US,en;q=0.5",
    "Connection":"keep-alive",
    "DNT":"1",
    "Sec-Fetch-Dest":"document", 
    "Sec-Fetch-Mode":"navigate",   
    "Sec-Fetch-Site":"cross-site",
    "Sec-Fetch-User":"?1",
    "Sec-GPC":"1",   
    "Upgrade-Insecure-Requests":"1",
    "TE":"trailers",
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; rv:122.0) Gecko/20100101 Firefox/122.0"    
    } 
def get_url():
    initialize()
    h = httplib2.Http()
    (resp, content) = h.request(url,"GET",headers=header) 
    print(content)

I call get_url() in order to get the content of the URL; however, it returns binary data like this b'\x90\x03\x02\x80\xfc-\xd5\xfe\xec\\N.

I do not have the same problem when testing the url in Cygwin curl.


Solution

  • the problem is in the section:

    "Accept-Encoding":"gzip, deflate, br"

    according to Python requests response encoded in utf-8 but cannot be decoded

    br requests Brotli compression, a new-ish compression standard (see RFC 7932) that Google is pushing to replace gzip on the web. Chrome is asking for Brotli because recent versions of Chrome understand it natively. You're asking for Brotli because you copied the headers from Chrome. But requests doesn't understand Brotli natively.

    By modifying to "Accept-Encoding":"gzip, deflate"

    encoding can be handled by httlib2 and data can be easily displayed.