I am trying to scrap data from a webpage that shows a limited amount of data, and requires the user to click a button to navigate to the next set of records. The webpage achieves that by sending GET requests to itself.
I tried to write a code in Python that would send a GET request to the page hoping to get the next set of results, and write a for loop to retrieve subsequent results, but I am always getting the initial sets (apparently the website is ignoring my params)
This is the website I am targeting: https://portaltransparencia.procempa.com.br/portalTransparencia/despesaLancamentoPesquisa.do?viaMenu=true&entidade=PROCEMPA
This is my code:
url = "https://portaltransparencia.procempa.com.br/portalTransparencia/despesaLancamentoPesquisa.do?viaMenu=true&entidade=PROCEMPA"
r_params = {
"perform": "view",
"actionForward": "success",
"validate": True,
"pesquisar": True,
"defaultSearch.pageSize":23,
"defaultSearch.currentPage": 2
}
page = requests.get(url, params=r_params)
I expected that this generated a response with data from the 2nd page, but it is responding that from the first page.
You can parse this site using requests
. If you open the site in a browser and the developer tool, then on the network tab by clicking on the first loaded document, you will see that requests to the second and subsequent pages are implemented through the post
method. You can also get all the data for queries there. Example for the first four pages and my browser by creating a Session
object.
import requests
url_1 = "https://portaltransparencia.procempa.com.br/portalTransparencia/despesaLancamentoPesquisa.do?viaMenu=true&entidade=PROCEMPA"
url_2 = "https://portaltransparencia.procempa.com.br/portalTransparencia/despesaLancamentoPesquisa.do"
data = {
"perform": "view",
"actionForward": "success",
"strutsFormName": "despesaLancamentoPesquisaForm",
"validate": True,
"pesquisar": True,
"defaultSearch.pageSize": 23
}
headers_1 = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"Cookie": "JSESSIONID=3F197220F9918E6664A9447744F980A2.lpmpa-app02; __utmc=192633643; __utmz=192633643.1695539321.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); style=default; __utma=192633643.722096071.1695539321.1695572080.1695577362.7; __utmt=1; __utmb=192633643.3.10.1695577362",
"Host": "portaltransparencia.procempa.com.br",
"Pragma": "no-cache",
"Sec-Ch-Ua": '"Chromium";v="116","Not)A;Brand";v ="24","Google Chrome";v="116"',
"Sec-Ch-Ua-Mobile": "?0",
"Sec-Ch-Ua-Platform": '"Windows"',
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "cross-site",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT10.0; Win64; x64) AppleWebKit/537.36 (KHTML, likeGecko) Chrome/116.0.0.0 Safari/537.36"
}
headers_2 = headers_1.copy()
headers_2.update({"Content-Length": "583",
"Content-Type": "application/x-www-form-urlencoded",
"Origin": "https://portaltransparencia.procempa.com.br",
"Sec-Fetch-Site": "same-origin"})
session = requests.Session()
page_1 = session.get(url_1, headers=headers_1, timeout=5)
if page_1.status_code == 200:
with open("ff_1.html", 'w', encoding="utf-8") as file:
file.write(page_1.text)
else:
print("Failed to fetch the content. Status code:", page_1.status_code)
for i in range(2, 5): # 1016 page
data["defaultSearch.currentPage"] = i
page = session.post(url_2, data=data, headers=headers_2, timeout=5)
if page.status_code == 200:
with open(f"ff_{i}.html", 'w', encoding="utf-8") as file:
file.write(page.text)
else:
print("Failed to fetch the content. Status code:", page.status_code)
The first rows of the received page tables.
ff_1.html
.....................
<tr id="linha0" class="oddLineSearchList" onmouseover="this.className='selectedLineSearchList';" onmouseout="this.className='oddLineSearchList';">
<td align="center" valign="top" title="Data Pagto">31/08/2023</td>
<td align="" valign="top" title="CPF/CNPJ">89.398.473/0001-00</td>
<td align="" valign="top" title="Nome Favorecido">COMPANHIA DE PROCESSAMENTO DE DADOS DO MUNICIPIO DE PORTO AL</td>
<td align="" valign="top" title="Processo">S/N</td>
<td align="" valign="top" title="Descrição Despesa">DESPESAS FINAN. C/EMPRESTIMOS E FINANCIAMENTOS</td>
<td align="" valign="top" title="<div style='text-align:center'>Nº Nota fiscal ou Doc.</div>">1-3/48</td>
<td align="right" valign="top" title="<div style='text-align:center'>Valor Bruto (R$)</div>">278.812,51</td>
<td align="right" valign="top" title="<div style='text-align:center'>Valor Retido (R$)</div>">0,00</td>
<td align="right" valign="top" title="<div style='text-align:center'>Valor Líquido (R$)</div>">278.812,51</td>
........................
ff_2.html
.......................
<tr id="linha0" class="oddLineSearchList" onmouseover="this.className='selectedLineSearchList';" onmouseout="this.className='oddLineSearchList';">
<td align="center" valign="top" title="Data Pagto">21/08/2023</td>
<td align="" valign="top" title="CPF/CNPJ">89.398.473/0001-00</td>
<td align="" valign="top" title="Nome Favorecido">COMPANHIA DE PROCESSAMENTO DE DADOS DO MUNICIPIO DE PORTO AL</td>
<td align="" valign="top" title="Processo">S/N</td>
<td align="" valign="top" title="Descrição Despesa">SALÁRIOS-DEMAIS</td>
<td align="" valign="top" title="<div style='text-align:center'>Nº Nota fiscal ou Doc.</div>">082023</td>
<td align="right" valign="top" title="<div style='text-align:center'>Valor Bruto (R$)</div>">10.825,00</td>
<td align="right" valign="top" title="<div style='text-align:center'>Valor Retido (R$)</div>">0,00</td>
<td align="right" valign="top" title="<div style='text-align:center'>Valor Líquido (R$)</div>">10.825,00</td>
..................
ff_3.html
........................
<tr id="linha0" class="oddLineSearchList" onmouseover="this.className='selectedLineSearchList';" onmouseout="this.className='oddLineSearchList';">
<td align="center" valign="top" title="Data Pagto">18/08/2023</td>
<td align="" valign="top" title="CPF/CNPJ">00.394.460/0058-87</td>
<td align="" valign="top" title="Nome Favorecido">MINISTERIO DA FAZENDA</td>
<td align="" valign="top" title="Processo">S/N</td>
<td align="" valign="top" title="Descrição Despesa">ALUGUEL DE DEPOSITO</td>
<td align="" valign="top" title="<div style='text-align:center'>Nº Nota fiscal ou Doc.</div>">202306-IRRF</td>
<td align="right" valign="top" title="<div style='text-align:center'>Valor Bruto (R$)</div>">315,77</td>
<td align="right" valign="top" title="<div style='text-align:center'>Valor Retido (R$)</div>">0,00</td>
<td align="right" valign="top" title="<div style='text-align:center'>Valor Líquido (R$)</div>">315,77</td>
.......................
Of course, this will work faster than Selenium
, but you will have to configure the program and catch exceptions.