I trying to extract some table html, but it returns some error and i have no idea why.
I really need some help here
Code:
from bs4 import BeautifulSoup
from io import BytesIO
import requests
import datetime
import re
import rows
# date = datetime.datetime.strptime("2013-1-25", '%Y-%m-%d').strftime('%m/%d/%y')
url = 'http://www1.caixa.gov.br/loterias/_arquivos/loterias/D_MEGA.HTM'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, 'lxml')
tabela = soup.find("table")
for tag in tabela.find_all('table'):
_ = tag.replaceWith('')
soup_tr = tabela.findAll("tr")
lista_tr = list(soup_tr)
lista_tr[0] = lista_tr[1]
s = "".join([str(l) for l in lista_tr])
s = "<table>" + s + "</table>"
s = re.sub("(<!--.*?-->)", "", s, flags=re.DOTALL)
table = rows.import_from_html(BytesIO(bytes(s, encoding='utf-8')))
Output error below:
File "C:\Users\atendimentopcp300_01\Desktop\Antony\Blue Challenge\megasena.py", line 6, in <module>
import rows
File "C:\Users\atendimentopcp300_01\Desktop\Antony\Blue Challenge\venv\lib\site-packages\rows\__init__.py", line 22, in <module>
import rows.plugins as plugins
File "C:\Users\atendimentopcp300_01\Desktop\Antony\Blue Challenge\venv\lib\site-packages\rows\plugins\__init__.py", line 24, in <module>
from . import plugin_html as html
File "C:\Users\atendimentopcp300_01\Desktop\Antony\Blue Challenge\venv\lib\site-packages\rows\plugins\plugin_html.py", line 43, in <module>
unescape = HTMLParser().unescape
AttributeError: 'HTMLParser' object has no attribute 'unescape'
This doesn't really solve your error, but there are other, easier ways of parsing tables from web-sites than the one you've embarked on.
Here's one of them:
import pandas as pd
import requests
page = requests.get("http://www1.caixa.gov.br/loterias/_arquivos/loterias/D_MEGA.HTM")
df = pd.read_html(page.text, flavor="bs4")
print(df)
df = pd.concat(df).to_csv("your_magnificent_table.csv", index=False)
Output:
[ Concurso Data Sorteio 1ª Dezena ... Rateio_Quadra Acumulado Valor_Acumulado
0 1 11/03/1996 4 ... 33021 SIM 1.714.65023
1 2 18/03/1996 9 ... 20891 NÃO 750.04891
2 3 25/03/1996 10 ... 15301 NÃO 000
3 4 01/04/1996 1 ... 18048 SIM 717.08075
4 5 08/04/1996 1 ... 9653 SIM 1.342.48885
.. ... ... ... ... ... ... ...
397 398 21/09/2002 28 ... 14129 NÃO 000
398 399 25/09/2002 59 ... 22501 SIM 5.676.17141
399 400 28/09/2002 29 ... 20314 SIM 6.869.04791
400 401 02/10/2002 50 ... 28818 SIM 7.859.38989
401 402 05/10/2002 27 ... 14808 SIM 9.248.37354
[402 rows x 16 columns]]
Or if you prefer, here's a .csv
file (actually, a part of it):
By the way, parsing HTML
by the means of regular expressions
is rather frowned upon and considered a poor choice. Here's more on the topic.