Search code examples
pythonpdfpython-requestsurllib

PDF corrupted after downloading with Python Requests


I need to download a file from one of those links with python, but the pdf won't open after downloaded. https://fnet.bmfbovespa.com.br/fnet/publico/exibirDocumento?id=693676 https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=693676

import requests

i = ["https://fnet.bmfbovespa.com.br/fnet/publico/exibirDocumento?id=693676", "https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=693676"]

l =0
for k in i:
    l += 1
    user_agent = "scrapping_script/1.0"
    headers = {'User-Agent': user_agent}
    download = requests.get(k, headers=headers)
    with open(f"/Users/renato/Documents/{l}.pdf", 'wb') as f:
        f.write(download.content)

I already tried using urllib and changing the headers, but had the same issue. Any suggestions? Thanks!


Solution

  • If I display downloaded file (in normal text editor, or in console) then I see string with characteristic == at the end which suggests that it file encoded with base64 - so it needs

    import base64
    
    content = base64.b64decode(download.content)
    

    That's all


    Full working code which I used for tests.

    I had to add verify=False because of problem with SSL verification.

    import base64
    import requests
    
    items = [
        "https://fnet.bmfbovespa.com.br/fnet/publico/exibirDocumento?id=693676", 
        "https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=693676"
    ]
    
    headers = {
        'User-Agent': "scrapping_script/1.0",
        #'User-Agent': "Mozilla/5.0 (X11; Linux x86_64; rv:127.0) Gecko/20100101 Firefox/127.0",
    }
    
    for index, url in enumerate(items, 1):
        print('url:', url)
        response = requests.get(url, headers=headers, verify=False)
        
        #for key, val in response.headers.items():
        #    print(f'{key}: {val}')
        if 'Content-Disposition' in response.headers:
            print('Content-Disposition:', response.headers['Content-Disposition'])
            filename = response.headers['Content-Disposition'].split('filename=')[-1].strip('"')
            print('filename:', filename)
        
        content = base64.b64decode(response.content)
        #print(content)
        
        #filename = f"/Users/renato/Documents/{index:02}.pdf" 
        filename = f"{index:02}.pdf" 
        print('filename:', filename)
        
        with open(filename, 'wb') as f:
            f.write(content)