Search code examples
pythonscrapypayloadscrapy-request

Scrapy FormRequest for a complicated payload


In a website with lawyers' work details, I'm trying to scrape information through this 4 layered algoritm where I need to do two FormRequests:

  1. Access the link containing the search box which submits the name of the lawyer requests (image1) ("ali" is passed as the name inquiry)
  2. Make the search request with the payload through FormRequest, thereby accessing the page with lawyers found (image2)
  3. Consecutively clicking on the magnifying glass buttons to reach the pages with each lawyers details through FormRequest (image3) (ERROR OCCURS HERE)
  4. Parsing each lawyer's data points indicated in image3

PROBLEM: My first FormRequest works that I can reach the list of lawyers. Then I encounter two problems:

  1. Problem1: My for loop only works for the first lawyer found.
  2. Problem2: Second FormRequest just doesn't work.

My insight: Checking the payload needed for the 2nd FormRequest for each lawyer requested, all the value numbers of as a bulk are added to the payload as well as the index number of the lawyer requested.

Am I really supposed to pass all the values for each request? How can send the correct payload? In my code I attempted to send the particular lawyer's value and index as a payload but it didn't work. What kind of a code should I use to get the details of all lawyers in the list?

image1 image2 image3

import scrapy
import json
import logging
from scrapy.utils.response import open_in_browser


class BarolevSpider(scrapy.Spider):
    name = 'barolev'
    allowed_domains = ['www.barobirlik.org.tr']
    start_urls = ['https://www.barobirlik.org.tr/levhaiframe/342']

    headers = {
            'authority': 'www.barobirlik.org.tr',
            'path': '/Islemler/LevhaIFrame',
            'scheme': 'https',
            'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
            'accept-encoding': 'gzip, deflate, br',
            'accept-language': 'en-US,en;q=0.9',
            'cache-control': 'max-age=0',
            'content-type': 'application/x-www-form-urlencoded',
            'origin': 'https://www.barobirlik.org.tr',
            'referer': 'https://www.barobirlik.org.tr/levhaiframe/342',
            'sec-fetch-dest': 'iframe',
            'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'
                }
    url = 'https://www.barobirlik.org.tr/Islemler/LevhaIFrame'

    def parse(self, response):

        #search inquiry body
        body = {
            'ad': 'ali',
            'soyad': '',
            'baroId': '342',
            'tip': 'Avukat',
                }
        logging.log(logging.INFO, "Search page opened")
        search = scrapy.FormRequest(self.url, callback = self.list_parser, method = "POST", headers=self.headers, formdata = body)

        yield search

    def list_parser(self, response):

        #open_in_browser(response)
        logging.log(logging.INFO, "Lawyers list's page opened")

        data = response.xpath("//div/table/tbody")
        lawyer_values = response.xpath("//input[@value]/text()")

        for counter, lawyer in enumerate(data):
            lic_number= response.xpath("//tbody/tr/td[3]/text()").get() #lawyer's licence number
            name = response.xpath(".//tbody/tr/td[4]/text()").get()
            surname = response.xpath("//tbody/tr/td[5]/text()").get()
            value = response.xpath(".//tbody/tr/td/input/@value").get()
            payload = {
                        counter : value,
                        'detay' : counter
                        }


            yield scrapy.FormRequest(self.url, callback= self.lawyer_parser, method = "POST", headers =self.headers, formdata = payload )




    def lawyer_parser(self, response):
        open_in_browser(response)
        pass

Scrapy logs:

2021-12-27 02:19:23 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: baro2)
2021-12-27 02:19:23 [scrapy.utils.log] INFO: Versions: lxml 4.6.4.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.9.9 (main, Nov 16 2021, 10:24:31) - [GCC 11.2.0], pyOpenSSL 21.0.0 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 3.4.8, Platform Linux-5.14.0-kali4-amd64-x86_64-with-glibc2.32
2021-12-27 02:19:23 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2021-12-27 02:19:23 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'baro2',
 'NEWSPIDER_MODULE': 'baro2.spiders',
 'SPIDER_MODULES': ['baro2.spiders']}
2021-12-27 02:19:23 [scrapy.extensions.telnet] INFO: Telnet Password: 05273ea1c26c6378
2021-12-27 02:19:23 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2021-12-27 02:19:23 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-12-27 02:19:23 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-12-27 02:19:23 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2021-12-27 02:19:23 [scrapy.core.engine] INFO: Spider opened
2021-12-27 02:19:23 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-12-27 02:19:23 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-12-27 02:19:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.barobirlik.org.tr/levhaiframe/342> (referer: None)
2021-12-27 02:19:24 [root] INFO: Search page opened
2021-12-27 02:19:26 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.barobirlik.org.tr/levhaiframe/342> (referer: https://www.barobirlik.org.tr/levhaiframe/342)
2021-12-27 02:19:26 [root] INFO: Lawyers list's page opened
2021-12-27 02:19:26 [scrapy.core.scraper] ERROR: Spider error processing <POST https://www.barobirlik.org.tr/levhaiframe/342> (referer: https://www.barobirlik.org.tr/levhaiframe/342)
Traceback (most recent call last):
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/utils/defer.py", line 120, in iter_errback
    yield next(it)
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/utils/python.py", line 353, in __next__
    return next(self.data)
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/utils/python.py", line 353, in __next__
    return next(self.data)
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/core/spidermw.py", line 56, in _evaluate_iterable
    for r in iterable:
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/core/spidermw.py", line 56, in _evaluate_iterable
    for r in iterable:
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/spidermiddlewares/referer.py", line 342, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/core/spidermw.py", line 56, in _evaluate_iterable
    for r in iterable:
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/spidermiddlewares/urllength.py", line 40, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/core/spidermw.py", line 56, in _evaluate_iterable
    for r in iterable:
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/core/spidermw.py", line 56, in _evaluate_iterable
    for r in iterable:
  File "/home/draco/scraping/scrapyyy/baro2/baro2/spiders/barolev.py", line 61, in list_parser
    yield scrapy.FormRequest(self.url, callback= self.lawyer_parser, method = "POST", headers =self.headers, formdata = payload )
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/http/request/form.py", line 31, in __init__
    querystr = _urlencode(items, self.encoding)
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/http/request/form.py", line 71, in _urlencode
    values = [(to_bytes(k, enc), to_bytes(v, enc))
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/http/request/form.py", line 71, in <listcomp>
    values = [(to_bytes(k, enc), to_bytes(v, enc))
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/utils/python.py", line 106, in to_bytes
    raise TypeError('to_bytes must receive a str or bytes '
TypeError: to_bytes must receive a str or bytes object, got int
2021-12-27 02:19:26 [scrapy.core.engine] INFO: Closing spider (finished)
2021-12-27 02:19:26 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 943,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 1,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 8193,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 2.788149,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 12, 26, 22, 19, 26, 603568),
 'httpcompression/response_bytes': 77405,
 'httpcompression/response_count': 2,
 'log_count/DEBUG': 2,
 'log_count/ERROR': 1,
 'log_count/INFO': 12,
 'memusage/max': 58245120,
 'memusage/startup': 58245120,
 'request_depth_max': 1,
 'response_received_count': 2,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'spider_exceptions/TypeError': 1,
 'start_time': datetime.datetime(2021, 12, 26, 22, 19, 23, 815419)}
2021-12-27 02:19:26 [scrapy.core.engine] INFO: Spider closed (finished)

Thanks in advance.


Solution

  • The website uses some kind of protection, this code works sometimes and once it's detected, you'll have to wait a while until their anti-bot clear things or use proxies instead:

    Import this:

    from re import sub
    

    Here is the main code:

    def list_parser(self, response):
        logging.log(logging.INFO, "Lawyers list's page opened")
        post_dict = []
        for x in response.xpath("//div[@class='table-responsive']/table//tbody/tr"):
            counter_val = x.xpath("./td[1]/button[@name='detay']/@value").get()
            profile_id = x.xpath(f"./td[1]/input[@name='{counter_val}']/@value").get()
            post_dict.append({counter_val: profile_id})
    
        post_str = "detay=1"
        for x in post_dict:
            _k = list(x.keys())[0]
            _v = list(x.values())[0]
            post_str += f"&{_k}={_v}"
    
        counter = 0
        for x in post_dict:
            counter += 1
            new_post_str = sub(r"detay=(\d+)", f"detay={counter}", post_str)
            yield scrapy.Request(self.url, callback=self.lawyer_parser, method="POST", headers=self.headers, body=new_post_str)
    
    def lawyer_parser(self, response):
        title = response.xpath("//title").get()
        print (title)
    

    Here is how it works with postman:

    POST https://www.barobirlik.org.tr/Islemler/LevhaIFrame
    content-type: application/x-www-form-urlencoded
    user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36
    
    detay=1&1=22300955972&2=34294103512&3=47650810900&4=31457011912&5=21989082740&6=13031762182&7=15098226362&8=29465540500&9=19705078726&10=52327454912&11=38077312146&12=15053874294&13=43504737980&14=15757111312&15=49528751810&16=22454512274&17=20735008140&18=14759958840&19=25331632218&20=23717241030&21=26188728710&22=22144254712&23=40831499468&24=45721682982&25=17524950834&26=20443640946&27=43375067858&28=25958522862&29=55642518444&30=15958205476&31=67798055186&32=53125619114&33=19700638734&34=47962570370&35=61324031392&36=62770087018&37=42020121822&38=51916778712&39=63445269098&40=20968442094&41=37150116442&42=19825577734&43=34690100306&44=61117020312&45=47005726450&46=32674457828&47=28318970648&48=46675921516&49=43891755192&50=63547258662&51=14786321440&52=49882225538&53=15026163486&54=71761039790&55=19769136608&56=13592195252&57=26261321722&58=67759140784&59=15443072422&60=14358051782&61=35387076706&62=31600743554&63=29132448610&64=50134148826&65=13741738194&66=46951136320&67=48070816756&68=40531941366&69=25712046412&70=50041757234&71=34792727562&72=31373015384&73=19936276632&74=11726671214&75=50752736252&76=10825141642&77=41269955922&78=22703604262&79=16439033762&80=38872658142&81=22313706936&82=20122397994&83=22807774748&84=25522741996&85=23408214992&86=12125473522&87=34900898696&88=51511402166&89=57145071426&90=27010985158&91=43045344210&92=15517773668&93=27065110338