Search code examples
python-2.7scrapyurllib2

Crawl a webpage where the input come from a textbox


I want to crawl this site: https://egov.uscis.gov/casestatus/landing.do

My aim is to write a python script that will alert me as soon as the status on this webpage changes, after entering the receipt number.

I have never done this before but did some reading on here: some have recommended urllib2 and others scrapy. I have a very very basic understanding of how this works.

But here is my problem:

When I enter a receipt number, the url of the webpage does not change after submission. Looking at the source page, I see where you need to enter the receipt number:

<input id="receipt_number" name="appReceiptNum" class="form-control textbox  initial-focus" maxlength="13" type="text">`

How do I pass this receipt number info into either urllib2 or scrapy or any other method. An example of a receipt number is EAC1590674053.

Any pointers greatly appreciated.


Solution

  • The website makes use of a Form. So you need to make scrapy to fill out the fields and submit the form. I've compiled some code to show how that can be done with scrapy:

    import scrapy
    
    class TestSpider(scrapy.Spider):
    
        name = 'casestatus'
        start_urls = ['https://egov.uscis.gov/casestatus/landing.do']
    
        def parse(self, response):
    
            request = scrapy.FormRequest.from_response(
                response,
                formname='caseStatusForm',
                formdata={'appReceiptNum': 'EAC1590674053'},
                callback=self.parse_caseStatus
            )
            print request.body
            yield request
    
        def parse_caseStatus(self,response):
            sel_current_status = response.xpath('//div[contains(@class,"current-status")]')
            if sel_current_status:
                txt_current_status = sel_current_status.xpath('./text()').extract()
                txt_current_status = " ".join(map(unicode.strip,txt_current_status))
                print txt_current_status
            else:
                print 'NO STATUS FOUND'
    
    # YIELDS THE FOLLOWING OUTPUT FOR ME:
    # [casestatus] DEBUG: Crawled (200) <POST https://egov.uscis.gov/casestatus/mycasestatus.do;jsessionid=A19A03FC933A208A2DDF89D98BE9F32E> (referer: https://egov.uscis.gov/casestatus/landing.do)
    # Case Rejected Because I Sent An Incorrect Fee