I´m trying to download a bunch of xls files from this ASPX site and its folders using urlretrieve from urllib.request module in Python3.7. First, I build a txt file with the urls from the site. Then, I loop over the list and ask the server to retrieve the xls file, according to this solution here.
The algorithm starts to download the xls file in the Working Directory but after 3 or 4 iterations, it cracks. The downloaded files (3 or 4) have an incorrect size (all of them 7351Kb, not 99Kb or 83Kb for example). Surprisingly, this is the last urls' size in the txt file.
Sometimes, the log sends a message with the 500 error.
For the last issue my hypothesis/questions are:
The error raises due to a firewall that prevents from repeated calls to the server
Maybe the calls are breaking asynchronous/asynchronous rules, unkwnown to me. I used a time.sleep in order to prevent the error but it failed.
The first issue is too weird and it is chained to the second one.
Here is my code:
import os
import time
from random import randint
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from urllib.request import urlopen, urlretrieve, quote
url="http://informacioninteligente10.xm.com.co/transacciones/Paginas/HistoricoTransacciones.aspx"
u = urlopen(url)
try:
html = u.read().decode('utf-8')
finally:
u.close()
direcciones = [] #to be populated with urls
soup = BeautifulSoup(html)
for link in soup.select('div[webpartid] a'):
href = link.get('href')
if href.startswith('javascript:'):
continue
filename = href.rsplit('/', 1)[-1]
href = urljoin(url, quote(href))
#try:
# urlretrieve(href, filename)
#except:
# print('Downloading Error')
if any (href.endswith(x) for x in ['.xls','.xlsx','.csv']):
direcciones.append(href)
# "\n" adds a new line
direcciones = '\n'.join(direcciones)
#Save every element in a txt file
with open("file.txt", "w") as output:
output.write(direcciones)
DOWNLOADS_DIR = os.getcwd()
# For every line in the file
for url in open("file.txt"):
time.sleep(randint(0,5))
# Split on the rightmost / and take everything on the right side of that
name = url.rsplit('/', 1)[-1]
# Combine the name and the downloads directory to get the local filename
filename = os.path.join(DOWNLOADS_DIR, name)
filename = filename[:-1] #Quitamos el espacio en blanco al final
# Download the file if it does not exist
if not os.path.isfile(filename):
urlretrieve(href, filename)
Am I not using the correct url parser?
Any ideas? Thanks!
it has anti bot, you need to set browser user agent instead of default python user agent
......
import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0')]
urllib.request.install_opener(opener)
url=....
and you have to replace href
to url
in
if not os.path.isfile(filename):
urlretrieve(href, filename) # must be: url