Search code examples
pythonweb-scrapingpostscrapy

Trying to make a POST request using Scrapy


I'm a beginner in web scraping in general. My goal is to scrap the site 'https://buscatextual.cnpq.br/buscatextual/busca.do', the thing is, this is a scientific site, so I need to check the box "Assunto(Título ou palavra chave da produção)" and also write in the main input of the page the word "grafos". How can I do it using Scrapy? I have been trying to do that with the following code but I had several errors and had never dealed with POST in general.

import scrapy

class LattesSpider(scrapy.Spider):
    name = 'lattesspider'
    login_url = 'https://buscatextual.cnpq.br/buscatextual/busca.do'
    start_urls = [login_url]

    
    def parse(self, response):
        data = {'filtros.buscaAssunto': True,
                'textoBusca': 'grafos'}
        yield scrapy.FormRequest(url=self.login_url, formdata=data, callback=self.parse_profiles)
    
    def parse_profiles(self, response):
        yield {'url': response.url,
               'nome': response.xpath("//a/text()").get()
               }

Solution

  • If it's a little difficult and unfamiliar for you to use Scrapy, and it is hard to locate certain things on the page, I suggest using playwright. Playwright and Scrapy are both pretty new libraries, playwright is slightly newer. The reason I suggest using playwright is because it's very easy to locate buttons, checkboxes, and fill text boxes, using either CSS selectors or xpath. I have put installation and documentation at the bottom of my answer.

    Here's some example code I pulled together that should work:

    from playwright.sync_api import sync_playwright
    
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=False)
        page = browser.new_page()
        page.goto('https://buscatextual.cnpq.br/buscatextual/busca.do')
        page.locator('input#buscaAssunto').check()
        page.locator('input#textoBusca').fill('grafos')
        page.wait_for_timeout(5000)
        browser.close()
    

    Here I used CSS, but you could also use xpath, playwright accepts both. Note that I launched chromium here, but you'll need a different line for every different browser.

    Chromium: browser = p.chromium.launch() Chrome: browser = p.chromium.launch(channel="chrome") Msedge: browser = p.chromium.launch(channel="msedge") Firefox: browser = p.firefox.launch() Webkit: browser = p.webkit.launch()

    Just replace that line with your current browser and that should work for you.

    Note that I also included the headless=False argument, which allowed me to see the browser opening and checking and filling boxes (mainly for testing). Do away with that argument to be in headless mode (by default). I included: page.wait_for_timeout(5000) to wait 5 seconds before closing the browser.

    Playwright: https://playwright.dev/python/docs/intro