Search code examples
javascriptpythonrequest

Navigate javascript form via Python Request


I am trying to scrap data from a webpage that shows a limited amount of data, and requires the user to click a button to navigate to the next set of records. The webpage achieves that by sending GET requests to itself.

I tried to write a code in Python that would send a GET request to the page hoping to get the next set of results, and write a for loop to retrieve subsequent results, but I am always getting the initial sets (apparently the website is ignoring my params)

This is the website I am targeting: https://portaltransparencia.procempa.com.br/portalTransparencia/despesaLancamentoPesquisa.do?viaMenu=true&entidade=PROCEMPA

This is my code:

url = "https://portaltransparencia.procempa.com.br/portalTransparencia/despesaLancamentoPesquisa.do?viaMenu=true&entidade=PROCEMPA"

r_params = {
    "perform": "view",
    "actionForward": "success",
    "validate": True,
    "pesquisar": True,
    "defaultSearch.pageSize":23,
    "defaultSearch.currentPage": 2
    }
page = requests.get(url, params=r_params)

I expected that this generated a response with data from the 2nd page, but it is responding that from the first page.


Solution

  • You can parse this site using requests. If you open the site in a browser and the developer tool, then on the network tab by clicking on the first loaded document, you will see that requests to the second and subsequent pages are implemented through the post method. You can also get all the data for queries there. Example for the first four pages and my browser by creating a Session object.

    import requests
    
    url_1 = "https://portaltransparencia.procempa.com.br/portalTransparencia/despesaLancamentoPesquisa.do?viaMenu=true&entidade=PROCEMPA"
    url_2 = "https://portaltransparencia.procempa.com.br/portalTransparencia/despesaLancamentoPesquisa.do"
    
    data = {
        "perform": "view",
        "actionForward": "success",
        "strutsFormName": "despesaLancamentoPesquisaForm",
        "validate": True,
        "pesquisar": True,
        "defaultSearch.pageSize": 23
        }
    
    headers_1 = {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
        "Accept-Encoding": "gzip, deflate, br",
        "Accept-Language": "ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7",
        "Cache-Control": "no-cache",
        "Connection": "keep-alive",
        "Cookie": "JSESSIONID=3F197220F9918E6664A9447744F980A2.lpmpa-app02; __utmc=192633643; __utmz=192633643.1695539321.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); style=default; __utma=192633643.722096071.1695539321.1695572080.1695577362.7; __utmt=1; __utmb=192633643.3.10.1695577362",
        "Host": "portaltransparencia.procempa.com.br",
        "Pragma": "no-cache",
        "Sec-Ch-Ua": '"Chromium";v="116","Not)A;Brand";v ="24","Google Chrome";v="116"',
        "Sec-Ch-Ua-Mobile": "?0",
        "Sec-Ch-Ua-Platform": '"Windows"',
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "cross-site",
        "Sec-Fetch-User": "?1",
        "Upgrade-Insecure-Requests": "1",
        "User-Agent": "Mozilla/5.0 (Windows NT10.0; Win64; x64) AppleWebKit/537.36 (KHTML, likeGecko) Chrome/116.0.0.0 Safari/537.36"
        }
    
    headers_2 = headers_1.copy()
    headers_2.update({"Content-Length": "583",
                      "Content-Type": "application/x-www-form-urlencoded",
                      "Origin": "https://portaltransparencia.procempa.com.br",
                      "Sec-Fetch-Site": "same-origin"})
    
    
    session = requests.Session()
    page_1 = session.get(url_1, headers=headers_1, timeout=5)
    
    if page_1.status_code == 200:
        with open("ff_1.html", 'w', encoding="utf-8") as file:
            file.write(page_1.text)
    else:
        print("Failed to fetch the content. Status code:", page_1.status_code)
    
    for i in range(2, 5): # 1016 page
    
        data["defaultSearch.currentPage"] = i
    
        page = session.post(url_2, data=data, headers=headers_2, timeout=5)
    
        if page.status_code == 200:
            with open(f"ff_{i}.html", 'w', encoding="utf-8") as file:
                file.write(page.text)
        else:
            print("Failed to fetch the content. Status code:", page.status_code)
    

    The first rows of the received page tables.

    ff_1.html
    
    .....................
    <tr id="linha0" class="oddLineSearchList" onmouseover="this.className='selectedLineSearchList';" onmouseout="this.className='oddLineSearchList';">
            <td align="center" valign="top" title="Data Pagto">31/08/2023</td>
            <td align="" valign="top" title="CPF/CNPJ">89.398.473/0001-00</td>
            <td align="" valign="top" title="Nome Favorecido">COMPANHIA DE PROCESSAMENTO DE DADOS DO MUNICIPIO DE PORTO AL</td>
            <td align="" valign="top" title="Processo">S/N</td>
            <td align="" valign="top" title="Descrição Despesa">DESPESAS FINAN. C/EMPRESTIMOS E FINANCIAMENTOS</td>
            <td align="" valign="top" title="<div style='text-align:center'>Nº&nbsp;Nota&nbsp;fiscal ou&nbsp;Doc.</div>">1-3/48</td>
            <td align="right" valign="top" title="<div style='text-align:center'>Valor&nbsp;Bruto (R$)</div>">278.812,51</td>
            <td align="right" valign="top" title="<div style='text-align:center'>Valor&nbsp;Retido (R$)</div>">0,00</td>
            <td align="right" valign="top" title="<div style='text-align:center'>Valor&nbsp;Líquido (R$)</div>">278.812,51</td>
    ........................
    
    ff_2.html
    
    .......................
    <tr id="linha0" class="oddLineSearchList" onmouseover="this.className='selectedLineSearchList';" onmouseout="this.className='oddLineSearchList';">
            <td align="center" valign="top" title="Data Pagto">21/08/2023</td>
            <td align="" valign="top" title="CPF/CNPJ">89.398.473/0001-00</td>
            <td align="" valign="top" title="Nome Favorecido">COMPANHIA DE PROCESSAMENTO DE DADOS DO MUNICIPIO DE PORTO AL</td>
            <td align="" valign="top" title="Processo">S/N</td>
            <td align="" valign="top" title="Descrição Despesa">SALÁRIOS-DEMAIS</td>
            <td align="" valign="top" title="<div style='text-align:center'>Nº&nbsp;Nota&nbsp;fiscal ou&nbsp;Doc.</div>">082023</td>
            <td align="right" valign="top" title="<div style='text-align:center'>Valor&nbsp;Bruto (R$)</div>">10.825,00</td>
            <td align="right" valign="top" title="<div style='text-align:center'>Valor&nbsp;Retido (R$)</div>">0,00</td>
            <td align="right" valign="top" title="<div style='text-align:center'>Valor&nbsp;Líquido (R$)</div>">10.825,00</td>
    ..................
    
    ff_3.html
    
    ........................
    <tr id="linha0" class="oddLineSearchList" onmouseover="this.className='selectedLineSearchList';" onmouseout="this.className='oddLineSearchList';">
            <td align="center" valign="top" title="Data Pagto">18/08/2023</td>
            <td align="" valign="top" title="CPF/CNPJ">00.394.460/0058-87</td>
            <td align="" valign="top" title="Nome Favorecido">MINISTERIO DA FAZENDA</td>
            <td align="" valign="top" title="Processo">S/N</td>
            <td align="" valign="top" title="Descrição Despesa">ALUGUEL DE DEPOSITO</td>
            <td align="" valign="top" title="<div style='text-align:center'>Nº&nbsp;Nota&nbsp;fiscal ou&nbsp;Doc.</div>">202306-IRRF</td>
            <td align="right" valign="top" title="<div style='text-align:center'>Valor&nbsp;Bruto (R$)</div>">315,77</td>
            <td align="right" valign="top" title="<div style='text-align:center'>Valor&nbsp;Retido (R$)</div>">0,00</td>
            <td align="right" valign="top" title="<div style='text-align:center'>Valor&nbsp;Líquido (R$)</div>">315,77</td>
    .......................
    

    Of course, this will work faster than Selenium, but you will have to configure the program and catch exceptions.