Search code examples
xpathscrapyweb-inspector

Scrapy xpath returns empty list in scrapy shell


I am trying to scrape startup articles on this page using scrapy shell using the following xpath command:

n = response.xpath('//article[contains(@class, "post-block post-block--image")]/header/h2/a/text()').getall()

n
[]

`

The command is only returning 0 articles instead of 18 which I can see when I try the

//article[contains(@class, "post-block post-block--image")]/header/h2/a/text()

on the inspector in Chrome. How do I get the articles in scrapy shell?


Solution

  • You can get it from json:

    scrapy shell
    
    In [1]: url = 'https://techcrunch.com/wp-json/tc/v1/magazine?page=1&_embed=true&_envelope=true&categories=20429&cachePr
       ...: evention=0'
    
    In [2]: headers = {
       ...: "Accept": "*/*",
       ...: "Accept-Encoding": "gzip, deflate, br",
       ...: "Accept-Language": "en-US,en;q=0.5",
       ...: "Cache-Control": "no-cache",
       ...: "Connection": "keep-alive",
       ...: "Content-Type": "application/json; charset=utf-8",
       ...: "DNT": "1",
       ...: "Host": "techcrunch.com",
       ...: "Pragma": "no-cache",
       ...: "Referer": "https://techcrunch.com/startups/",
       ...: "Sec-Fetch-Dest": "empty",
       ...: "Sec-Fetch-Mode": "cors",
       ...: "Sec-Fetch-Site": "same-origin",
       ...: "Sec-GPC": "1",
       ...: "TE": "trailers",
       ...: "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.372
       ...: 9.169 Safari/537.36",
       ...: "X-KL-Ajax-Request": "Ajax_Request",
       ...: "X-TC-EC-Auth-Token": "",
       ...: "X-TC-UUID": ""
       ...: }
    
    In [3]: req = scrapy.Request(url=url, headers=headers)
    
    In [4]: fetch(req)
    [scrapy.core.engine] INFO: Spider opened
    [scrapy.core.engine] DEBUG: Crawled (200) <GET https://techcrunch.com/wp-json/tc/v1/magazine?page=1&_embed=true&_envelope=true&categories=20429&cachePrevention=0> (referer: https://techcrunch.com/startups/)
    
    In [5]: view(response)
    Out[5]: True
    
    In [6]: body = response.json()['body']
    
    In [7]: for b in body:
       ...:     print(b['slug'])
       ...:
    how-to-claim-a-student-discount-for-techcrunch
    chimes-chris-britt-and-menlo-ventures-shawn-carolan-to-talk-fintech-on-techcrunch-live
    graphwear-closes-20-5m-series-b-for-a-needle-free-nanotech-powered-glucose-monitor
    investors-share-how-infrastructure-as-code-is-taking-over-devops
    informaticas-ipo-will-test-public-markets-appetite-for-slower-growing-tech-offerings
    index-sequoia-and-canvas-investors-weigh-in-on-how-to-raise-your-first-dollars
    lawpath-gets-7-5m-aud-to-become-the-asia-pacifics-legalzoom
    equity-monday-byjus-raises-more-money-somehow-as-tech-stocks-fall
    stories-as-a-service-storyteller-lets-anyone-add-stories-to-their-own-apps-or-website
    rich-and-worried-about-the-world-put-your-money-where-your-concern-is
    made-of-air-a-maker-of-carbon-negative-thermoplastics-locks-in-5-8m
    insurtech-stable-raises-46-5m-in-greycroft-led-round-to-help-businesses-manage-volatile-commodity-prices
    as-apple-messes-with-attribution-what-does-growth-marketing-look-like-in-2021
    yc-grads-wasp-land-1-5m-seed-to-help-developers-build-web-apps-faster
    ladder-raises-100m-on-a-900m-valuation-for-a-platform-selling-flexible-term-life-insurance
    devops-market-demand-drives-quick-series-c-turnaround-for-esper
    elevate-launches-its-approach-to-managing-pre-tax-benefits-with-12m-series-a
    to-the-market-takes-on-funding-to-create-ethical-sustainable-work-environments-for-women
    indian-edtech-giant-byjus-valued-at-18-billion-in-new-funding
    komunidad-a-philippines-based-environmental-intelligence-platform-lands-seed-round