Search code examples
pythonscrapy

Scrapy's LinkExtractor does not find all links on webpage


currently I am trying to programm my own webcrawler for real estate search in Vienna. For that I use the CrawlSpider Class from Scrapy. What I now found out using the Shell function from scrapy is that the LinkExtractor does not find all links on the webpage. How could I solve that?

Thanks, jamfleck

PS: Sorry if my explanation lack details, I am quite new to stackoverflow.

### I open the shell in the anaconda command prompt.
### The url is a overview page of several house listings.
scrapy shell https://www.willhaben.at/iad/immobilien/haus-kaufen/wien 

### Import of the scrapy linkextractor:
from scrapy.linkextractors import LinkExtractor

### I create the linkextractor to search for listings.
### I found out that all listings have a 'd/haus-kaufen/wien' in their url
le = LinkExtractor(allow='d/haus-kaufen/wien')

### Extract all the links:
links = le.extract_links(response)

### Print out the links
for l in links:
    ...:     print(l.url)

Output:

https://www.willhaben.at/iad/immobilien/d/haus-kaufen/wien/wien-1110-simmering/einfamilienhaus-pool-140m2-keller-und-ca-243m2-garten-kamin-fussbodenheizung-2018-2019-eg-saniert-in-wenigen-minuten-simmering-u3-601664118/
https://www.willhaben.at/iad/immobilien/d/haus-kaufen/wien/wien-1220-donaustadt/tausch-reihenhaus-gegen-wohnung-663241383/
https://www.willhaben.at/iad/immobilien/d/haus-kaufen/wien/wien-1170-hernals/knusperhaeuschen-im-gruenen-diverse-obstbaeume-seele-baumeln-lassen-667128712/
https://www.willhaben.at/iad/immobilien/d/haus-kaufen/wien/wien-1130-hietzing/sonniges-einfamilienhaus-aus-familienbesitz-in-1130-wien-667121768/

In total 4 links appear, but when I look on the webpage their should be much more, around 20.


Solution

  • Modern websites often use Javascript to dynamically load content which isn't needed instantly. This improves the initial page load, however most scrapers cannot handle JS, meaning this content is invisible to them.

    The website you linked does exactly that. If you visit the page with a Javascript blocker like NoScript, only the first 4-5 posts are visible, since the rest would get fetched via Javascript.

    Scrapy's website has a writeup about workarounds for this issue.

    The interesting thing about this specific site is, that they do in fact fetch all entries with the initial request, and only use JS to build the UI afterwards.

    You can see it in the raw website HTML:

    <script type="application/ld+json">
    {
        "@context": "https://schema.org",
        "@type": "ItemList",
        "itemListElement": [
            {
                "@type": "ListItem",
                "position": 0,
                "url": "/iad/immobilien/d/haus-kaufen/wien/wien-1110-simmering/kleingarten-in-simmering-667272230/"
            },
            ...
        ],
        "numberOfItems": 1236,
        "name": "Haus kaufen in Wien - willhaben",
        "description": "Haus kaufen oder verkaufen in Wien, finden Sie Ihr Einfamilienhaus, Reihenhaus unter 13.382 Häusern auf willhaben"
    }
    </script>
                                        
    

    So in this specific case, you can just extract this JSON object contained in the HTML to get the results of the search request.

    A possible XPath: //*[@id="skip-to-content"]/div/script

    Hope that helps!