Search code examples
pythonbeautifulsouprows

AttributeError: 'HTMLParser' object has no attribute 'unescape'


I trying to extract some table html, but it returns some error and i have no idea why.

I really need some help here

Code:

from bs4 import BeautifulSoup
from io import BytesIO
import requests
import datetime
import re
import rows


# date = datetime.datetime.strptime("2013-1-25", '%Y-%m-%d').strftime('%m/%d/%y')
url = 'http://www1.caixa.gov.br/loterias/_arquivos/loterias/D_MEGA.HTM'

response = requests.get(url)
html = response.content


soup = BeautifulSoup(html, 'lxml')
tabela = soup.find("table")

for tag in tabela.find_all('table'):
    _ = tag.replaceWith('')


soup_tr = tabela.findAll("tr")
lista_tr = list(soup_tr)
lista_tr[0] = lista_tr[1]


s = "".join([str(l) for l in lista_tr])
s = "<table>" + s + "</table>"
s = re.sub("(<!--.*?-->)", "", s, flags=re.DOTALL)


table = rows.import_from_html(BytesIO(bytes(s, encoding='utf-8')))

Output error below:

  File "C:\Users\atendimentopcp300_01\Desktop\Antony\Blue Challenge\megasena.py", line 6, in <module>
    import rows
  File "C:\Users\atendimentopcp300_01\Desktop\Antony\Blue Challenge\venv\lib\site-packages\rows\__init__.py", line 22, in <module>
    import rows.plugins as plugins
  File "C:\Users\atendimentopcp300_01\Desktop\Antony\Blue Challenge\venv\lib\site-packages\rows\plugins\__init__.py", line 24, in <module>
    from . import plugin_html as html
  File "C:\Users\atendimentopcp300_01\Desktop\Antony\Blue Challenge\venv\lib\site-packages\rows\plugins\plugin_html.py", line 43, in <module>
    unescape = HTMLParser().unescape
AttributeError: 'HTMLParser' object has no attribute 'unescape'

Solution

  • This doesn't really solve your error, but there are other, easier ways of parsing tables from web-sites than the one you've embarked on.

    Here's one of them:

    import pandas as pd
    import requests
    
    page = requests.get("http://www1.caixa.gov.br/loterias/_arquivos/loterias/D_MEGA.HTM")
    df = pd.read_html(page.text, flavor="bs4")
    print(df)
    df = pd.concat(df).to_csv("your_magnificent_table.csv", index=False)
    
    

    Output:

    [     Concurso Data Sorteio  1ª Dezena  ...  Rateio_Quadra  Acumulado  Valor_Acumulado
    0           1   11/03/1996          4  ...          33021        SIM      1.714.65023
    1           2   18/03/1996          9  ...          20891        NÃO        750.04891
    2           3   25/03/1996         10  ...          15301        NÃO              000
    3           4   01/04/1996          1  ...          18048        SIM        717.08075
    4           5   08/04/1996          1  ...           9653        SIM      1.342.48885
    ..        ...          ...        ...  ...            ...        ...              ...
    397       398   21/09/2002         28  ...          14129        NÃO              000
    398       399   25/09/2002         59  ...          22501        SIM      5.676.17141
    399       400   28/09/2002         29  ...          20314        SIM      6.869.04791
    400       401   02/10/2002         50  ...          28818        SIM      7.859.38989
    401       402   05/10/2002         27  ...          14808        SIM      9.248.37354
    
    [402 rows x 16 columns]]
    

    Or if you prefer, here's a .csv file (actually, a part of it):

    enter image description here


    By the way, parsing HTML by the means of regular expressions is rather frowned upon and considered a poor choice. Here's more on the topic.