Search code examples
beautifulsoupcaptchahttp-errorurlopen

Webscrape ISBN info from brazilian website


I'm trying to get some tags with beautiful soup, to generate a bibtex entry with this data.

The ISBN brazilian site, when access from browser, shows the informations about that ISBN. But when i tried to use urlopen and requests, it gives me a HTTPError code 500. In browser this happened, and only resolved by closing the tab and opening the same link in another tab.

The website asks for captcha. I think the first search need to be answering the captcha, and the others, just changing the isbn in url will works.

After this, when you hit 'link+isbn' it shows the information about the book. I'm trying to use this 'link+isbn' to webscrape with beautifoul soup.

Link that works: http://www.isbn.bn.br/website/consulta/cadastro/isbn/9788521208037 -- (do a first search in 'www.isbn. ... /cadastro' fisrt, because the captcha)

I tried with some codes, and now i'm just trying to get the html of website without error 500.

import sys
import urllib
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

BRbase = 'http://www.isbn.bn.br/website/consulta/cadastro/isbn/'

Lista_ISBN = ['9788542209402',
              '9788542206937',
              '9788521208037']

for isbn in Lista_ISBN:
    page = BRbase + isbn
    url = Request(page, headers={'User-Agent': 'Mozilla/5.0'})
    html = urlopen(url).read()
    #code to beautiful soup
    try:
        #code to beautiful soup and generate bibtex
        print(page)
        print(html)
        
    except:
        print('ISBN {} não encontrado'.format(isbn))
sys.exit(1)

Solution

  • import requests
    from bs4 import BeautifulSoup
    
    headers = {"Cookie": 'JSESSIONID=60F8CDFBD408299B40C7E7C2459DC624'}
    
    isbn = ['9788542209402', '9788542206937', '9788521208037']
    
    for item in isbn:
        print(f"{'*'*20}Extracting ISBN# {item}{'*'*20}")
        r = requests.get(
            f"http://www.isbn.bn.br/website/consulta/cadastro/isbn/{item}", headers=headers)
        soup = BeautifulSoup(r.text, 'html.parser')
        for item in soup.findAll('strong')[2:10]:
            print(item.parent.get_text(strip=True))
    

    Output:

    ********************Extracting ISBN# 9788542209402********************
    ISBN978-85-422-0940-2
    TítuloSPQR
    Edição1
    Ano Edição2017
    Tipo de SuportePapel
    Páginas448
    Editor(a)Planeta
    ParticipaçõesMary Beard ( Autor)Luiz Gil Reyes (Tradutor)
    ********************Extracting ISBN# 9788542206937********************
    ISBN978-85-422-0693-7
    TítuloEm nome de Roma
    Edição1
    Ano Edição2016
    Tipo de SuportePapel
    Páginas560
    Editor(a)Planeta
    ParticipaçõesAdrian Goldsworthy ( Autor)Claudio Blanc (Tradutor)
    ********************Extracting ISBN# 9788521208037********************
    ISBN978-85-212-0803-7
    TítuloCurso de física básica: ótica, relatividade e física quântica
    Edição2
    Ano Edição2014
    Tipo de SuportePapel
    Páginas0
    Editor(a)Blucher
    ParticipaçõesH. Moysés Nussenzveig ( Autor)