Search code examples
pythonweb-scrapingscrapysplash-screenscrapyjs

Scrapy POST to a Javascript generated form using Splash


I have the following spider that's pretty much just supposed to Post to a form. I can't seem to get it to work though. The response never shows when i do it through Scrapy. Could some one tell me where i'm going wrong with this?

Here's my spider code:

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import scrapy
from scrapy.http import FormRequest
from scrapy.shell import inspect_response


class RajasthanSpider(scrapy.Spider):
    name = "rajasthan"
    allowed_domains = ["rajtax.gov.in"]
    start_urls = (
        'http://www.rajtax.gov.in/',
    )

    def parse(self, response):
        return FormRequest.from_response(
            response,
            formname='rightMenuForm',
            formdata={'dispatch': 'dealerSearch'},
            callback=self.dealer_search_page)

    def dealer_search_page(self, response):

        yield FormRequest.from_response(
            response,
            formname='dealerSearchForm',
            formdata={
                "zone": "select",
                "dealertype": "VAT",
                "dealerSearchBy": "dealername",
                "name": "ana"
            }, callback=self.process)

    def process(self, response):
        inspect_response(response, self)

What i get is a response as such: No result Found

What I should be getting is a result like this: Results Found

When i replace my dealer_search_page() with Splash as such:

def dealer_search_page(self, response):

    yield FormRequest.from_response(
        response,
        formname='dealerSearchForm',
        formdata={
            "zone": "select",
            "dealertype": "VAT",
            "dealerSearchBy": "dealername",
            "name": "ana"
        },
        callback=self.process,
        meta={
            'splash': {
                'endpoint': 'render.html',
                'args': {'wait': 0.5}
            }
        })

i get the following warning:

2016-03-14 15:01:29 [scrapy] WARNING: Currently only GET requests are supported by SplashMiddleware; <POST http://rajtax.gov.in:80/vatweb/dealerSearch.do> will be handled without Splash

And the program exits before it reaches my inspect_response() in my process() function.

The error says that Splash doesn't support POST yet. Will Splash work for this use case or should i be using Selenium?


Solution

  • You can approach it with selenium. Here is an complete working example where we submit the form with the same search parameters as in your Scrapy code and print the results on the console:

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.select import Select
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    
    driver = webdriver.Firefox()
    driver.get("http://www.rajtax.gov.in/")
    
    # accept the alert
    driver.switch_to.alert.accept()
    
    # open "Search for Dealers"
    wait = WebDriverWait(driver, 10)
    search_for_dealers = wait.until(EC.visibility_of_element_located((By.PARTIAL_LINK_TEXT, "Search for Dealers")))
    search_for_dealers.click()
    
    # set search parameters
    dealer_type = Select(driver.find_element_by_name("dealertype"))
    dealer_type.select_by_visible_text("VAT")
    
    search_by = Select(driver.find_element_by_name("dealerSearchBy"))
    search_by.select_by_visible_text("Dealer Name")
    
    search_criteria = driver.find_element_by_name("name")
    search_criteria.send_keys("ana")
    
    # search
    driver.find_element_by_css_selector("table.vattabl input.submit").click()
    
    # wait for and print results
    table = wait.until(EC.visibility_of_element_located((By.XPATH, "//table[@class='pagebody']/following-sibling::table")))
    
    for row in table.find_elements_by_css_selector("tr")[1:]:  # skipping header row
        print(row.find_elements_by_tag_name("td")[1].text)
    

    Prints the TIN numbers from the search results table:

    08502557052
    08451314461
    ...
    08734200736
    

    Note that the browser you automate with selenium can be headless - PhantomJS or regular browsers on a virtual display.


    Answering the initial question (before the edit):

    What I see on the Dealer Search page - the form and its fields are constructed with a bunch of JavaScript scripts executed in the browser. Scrapy cannot execute JS, you need to help it with this part. I am pretty sure Scrapy+Splash would be enough in this case and you would not need to go into browser automation. Here is a working example of using Scrapy with Splash: