I am trying to get to details page of this site
To get there from the web one should click 1. Consula Titlulo 2. Select ORO from Minerals dropdown and 3. click Buscar. 4. Then click the very first item in the list.
Dev tools and Fiddler show that I should make POST request with item id as a payload and this POST request is then redirected to details page.
In my case Im being redirected to homepage. What do I miss?
Here is my Scrapy spider.
# -*- coding: utf-8 -*-
import scrapy
from scrapy.shell import inspect_response
class CodeSpider(scrapy.Spider):
name = "col"
start_urls =['http://www.cmc.gov.co:8080/CmcFrontEnd/consulta/index.cmc']
headers ={
"Connection": "keep-alive",
"Cache-Control": "max-age=0",
"Origin": "http://www.cmc.gov.co:8080",
"Upgrade-Insecure-Requests": "1",
"DNT": "1",
"Content-Type": "application/x-www-form-urlencoded",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1, AppleWebKit/537.36 (KHTML, like Gecko, Chrome/68.0.3440.106 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Referer":"http://www.cmc.gov.co:8080/CmcFrontEnd/consulta/busqueda.cmc",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.9,ru;q=0.8,uk;q=0.7",
}
def parse(self, response):
inspect_response(response, self)
payload = {'expediente': '29', 'tipoSolicitud': ''}
url = 'http://www.cmc.gov.co:8080/CmcFrontEnd/consulta/busqueda.cmc'
yield scrapy.FormRequest(url, formdata = payload, headers=self.headers, callback = self.parse, dont_filter=True)
Here is the log with redirect.
2018-08-23 13:58:05 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://www.cmc.gov.co:8080/CmcFrontEnd/consulta/index.cmc> from <POST http://
www.cmc.gov.co:8080/CmcFrontEnd/consulta/busqueda.cmc>
2018-08-23 13:58:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cmc.gov.co:8080/CmcFrontEnd/consulta/index.cmc> (referer: http://www.cmc.gov.co:8080/CmcFron
tEnd/consulta/busqueda.cmc)
From what I see scrapy also assigns correct Cookie before sending request.
In [2]: request.headers
Out[2]:
{'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.9,ru;q=0.8,uk;q=0.7',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Cookie': 'PHPSESSID=1um6r67md5qpdcqs9g2n15g605',
'Dnt': '1',
'Origin': 'http://www.cmc.gov.co:8080',
'Referer': 'http://www.cmc.gov.co:8080/CmcFrontEnd/consulta/busqueda.cmc',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1, AppleWebKit/537.36 (KHTML, like Gecko, Chrome/68.0.3440.106 Safari/537.36'}
What do I miss?
Moreover if I use Postman code with GET for details page it works fine and returns the page. Same code in Scrapy redirects.
In [1]: url = "http://www.cmc.gov.co:8080/CmcFrontEnd/consulta/detalleExpedienteTitulo.cmc"^M
...: ^M
...: headers = {^M
...: 'upgrade-insecure-requests': "1",^M
...: 'user-agent': "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",^M
...: 'dnt': "1",^M
...: 'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",^M
...: 'referer': "http://www.cmc.gov.co:8080/CmcFrontEnd/consulta/busqueda.cmc",^M
...: 'accept-encoding': "gzip, deflate",^M
...: 'accept-language': "en-US,en;q=0.9,ru;q=0.8,uk;q=0.7",^M
...: 'cookie': "PHPSESSID=2ba8dsre6l42un95qu33k09ud6",^M
...: 'cache-control': "no-cache",^M
...: ^M
...: }^M
...:
In [2]: fetch(url, headers=headers)
2018-08-23 14:47:13 [scrapy.core.engine] INFO: Spider opened
2018-08-23 14:47:13 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://www.cmc.gov.co:8080/CmcFrontEnd/consulta/index.cmc> from <GET http://w
ww.cmc.gov.co:8080/CmcFrontEnd/consulta/detalleExpedienteTitulo.cmc>
2018-08-23 14:47:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cmc.gov.co:8080/CmcFrontEnd/consulta/index.cmc> (referer: http://www.cmc.gov.co:8080/CmcFron
tEnd/consulta/busqueda.cmc)
It appears that I missed POST request in the very beggining. This post request generates correct session ID which is to be new for every other search.