python web-scraping scrapy splash-screen scrapyjs

Scrapy POST to a Javascript generated form using Splash

I have the following spider that's pretty much just supposed to Post to a form. I can't seem to get it to work though. The response never shows when i do it through Scrapy. Could some one tell me where i'm going wrong with this?

Here's my spider code:

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import scrapy
from scrapy.http import FormRequest
from scrapy.shell import inspect_response


class RajasthanSpider(scrapy.Spider):
    name = "rajasthan"
    allowed_domains = ["rajtax.gov.in"]
    start_urls = (
        'http://www.rajtax.gov.in/',
    )

    def parse(self, response):
        return FormRequest.from_response(
            response,
            formname='rightMenuForm',
            formdata={'dispatch': 'dealerSearch'},
            callback=self.dealer_search_page)

    def dealer_search_page(self, response):

        yield FormRequest.from_response(
            response,
            formname='dealerSearchForm',
            formdata={
                "zone": "select",
                "dealertype": "VAT",
                "dealerSearchBy": "dealername",
                "name": "ana"
            }, callback=self.process)

    def process(self, response):
        inspect_response(response, self)

What i get is a response as such:

What I should be getting is a result like this:

When i replace my dealer_search_page() with Splash as such:

def dealer_search_page(self, response):

    yield FormRequest.from_response(
        response,
        formname='dealerSearchForm',
        formdata={
            "zone": "select",
            "dealertype": "VAT",
            "dealerSearchBy": "dealername",
            "name": "ana"
        },
        callback=self.process,
        meta={
            'splash': {
                'endpoint': 'render.html',
                'args': {'wait': 0.5}
            }
        })

i get the following warning:

2016-03-14 15:01:29 [scrapy] WARNING: Currently only GET requests are supported by SplashMiddleware; <POST http://rajtax.gov.in:80/vatweb/dealerSearch.do> will be handled without Splash

And the program exits before it reaches my inspect_response() in my process() function.

The error says that Splash doesn't support POST yet. Will Splash work for this use case or should i be using Selenium?

Solution

You can approach it with selenium. Here is an complete working example where we submit the form with the same search parameters as in your Scrapy code and print the results on the console:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.select import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.Firefox()
driver.get("http://www.rajtax.gov.in/")

# accept the alert
driver.switch_to.alert.accept()

# open "Search for Dealers"
wait = WebDriverWait(driver, 10)
search_for_dealers = wait.until(EC.visibility_of_element_located((By.PARTIAL_LINK_TEXT, "Search for Dealers")))
search_for_dealers.click()

# set search parameters
dealer_type = Select(driver.find_element_by_name("dealertype"))
dealer_type.select_by_visible_text("VAT")

search_by = Select(driver.find_element_by_name("dealerSearchBy"))
search_by.select_by_visible_text("Dealer Name")

search_criteria = driver.find_element_by_name("name")
search_criteria.send_keys("ana")

# search
driver.find_element_by_css_selector("table.vattabl input.submit").click()

# wait for and print results
table = wait.until(EC.visibility_of_element_located((By.XPATH, "//table[@class='pagebody']/following-sibling::table")))

for row in table.find_elements_by_css_selector("tr")[1:]:  # skipping header row
    print(row.find_elements_by_tag_name("td")[1].text)

Prints the TIN numbers from the search results table:

08502557052
08451314461
...
08734200736

Note that the browser you automate with selenium can be headless - PhantomJS or regular browsers on a virtual display.

Answering the initial question (before the edit):

What I see on the Dealer Search page - the form and its fields are constructed with a bunch of JavaScript scripts executed in the browser. Scrapy cannot execute JS, you need to help it with this part. I am pretty sure Scrapy+Splash would be enough in this case and you would not need to go into browser automation. Here is a working example of using Scrapy with Splash:

Scraping dynamic content using python-Scrapy