Search code examples
python-3.xbeautifulsoupweb-crawlerhttp-status-code-500urlretrieve

Downloading xls/csv files using urlretrieve from Python stops


I´m trying to download a bunch of xls files from this ASPX site and its folders using urlretrieve from urllib.request module in Python3.7. First, I build a txt file with the urls from the site. Then, I loop over the list and ask the server to retrieve the xls file, according to this solution here.

The algorithm starts to download the xls file in the Working Directory but after 3 or 4 iterations, it cracks. The downloaded files (3 or 4) have an incorrect size (all of them 7351Kb, not 99Kb or 83Kb for example). Surprisingly, this is the last urls' size in the txt file.

Sometimes, the log sends a message with the 500 error.

For the last issue my hypothesis/questions are:

  1. The error raises due to a firewall that prevents from repeated calls to the server

  2. Maybe the calls are breaking asynchronous/asynchronous rules, unkwnown to me. I used a time.sleep in order to prevent the error but it failed.

The first issue is too weird and it is chained to the second one.

Here is my code:

import os
import time    
from random import randint
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from urllib.request import urlopen, urlretrieve, quote    



url="http://informacioninteligente10.xm.com.co/transacciones/Paginas/HistoricoTransacciones.aspx"
        u = urlopen(url)
        try:
            html = u.read().decode('utf-8')
        finally:
            u.close()
direcciones = [] #to be populated with urls

soup = BeautifulSoup(html)
for link in soup.select('div[webpartid] a'):
    href = link.get('href')
    if href.startswith('javascript:'):
        continue
    filename = href.rsplit('/', 1)[-1]

    href = urljoin(url, quote(href))
    #try:
    #    urlretrieve(href, filename)
    #except:
    #    print('Downloading Error')
    
    if any (href.endswith(x) for x in ['.xls','.xlsx','.csv']):
        direcciones.append(href)

# "\n"  adds a new line
direcciones = '\n'.join(direcciones)


#Save every element in a txt file
with open("file.txt", "w") as output:
     output.write(direcciones) 


DOWNLOADS_DIR = os.getcwd()

# For every line in the file
for url in open("file.txt"):
    time.sleep(randint(0,5))

    # Split on the rightmost / and take everything on the right side of that
    name = url.rsplit('/', 1)[-1]

    # Combine the name and the downloads directory to get the local filename
    filename = os.path.join(DOWNLOADS_DIR, name)
    filename = filename[:-1] #Quitamos el espacio en blanco al final

    # Download the file if it does not exist
    if not os.path.isfile(filename):
        urlretrieve(href, filename)

Am I not using the correct url parser?

Any ideas? Thanks!


Solution

  • it has anti bot, you need to set browser user agent instead of default python user agent

    ......
    import urllib.request
    
    opener = urllib.request.build_opener()
    opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0')]
    urllib.request.install_opener(opener)
    
    url=....
    

    and you have to replace href to url in

    if not os.path.isfile(filename):
        urlretrieve(href, filename) # must be: url