Search code examples
pythonweb-scrapingbeautifulsouppython-requestsapify

Beautiful soup - html parser returns dots instead of string visible on web


I'm trying to get the number of actors from: https://apify.com/store which is under the following HTML:

<div class="ActorStore-statusNbHits">
<span class="ActorStore-statusNbHitsNumber">895</span>results</div>

When I send get request and parse response with BeautifulSoup using:

r = requests.get(base_url)
soup = BeautifulSoup(r.text, "html.parser")
return soup.find("span", class_="ActorStore-statusNbHitsNumber").text

I get three dots ... instead of the number 895 the element is <span class="ActorStore-statusNbHitsNumber">...</span>

How can I get the number?


Solution

  • If you inspect the network calls in your browser (press F12) and filter by XHR, you'll see that the data is loaded dynamically via sending a POST request:

    enter image description here

    You can mimic that request via sending the correct json data. There's no need for BeautifulSoup you can use only the requests module.

    Here is a complete working example:

    import requests
    
    
    data = {
        "query": "",
        "page": 0,
        "hitsPerPage": 24,
        "restrictSearchableAttributes": [],
        "attributesToHighlight": [],
        "attributesToRetrieve": [
            "title",
            "name",
            "username",
            "userFullName",
            "stats",
            "description",
            "pictureUrl",
            "userPictureUrl",
            "notice",
            "currentPricingInfo",
        ],
    }
    response = requests.post(
        "https://ow0o5i3qo7-dsn.algolia.net/1/indexes/prod_PUBLIC_STORE/query?x-algolia-agent=Algolia%20for%20JavaScript%20(4.12.1)%3B%20Browser%20(lite)&x-algolia-api-key=0ecccd09f50396a4dbbe5dbfb17f4525&x-algolia-application-id=OW0O5I3QO7",
        json=data,
    )
    
    
    print(response.json()["nbHits"])
    

    Output:

    895
    

    To view all the JSON data in order to access the key/value pairs, you can use:

    from pprint import pprint
    pprint(response.json(), indent=4)
    

    Partial output:

    {   'exhaustiveNbHits': True,
        'exhaustiveTypo': True,
        'hits': [   {   'currentPricingInfo': None,
                        'description': 'Crawls arbitrary websites using the Chrome '
                                       'browser and extracts data from pages using '
                                       'a provided JavaScript code. The actor '
                                       'supports both recursive crawling and lists '
                                       'of URLs and automatically manages '
                                       'concurrency for maximum performance. This '
                                       "is Apify's basic tool for web crawling and "
                                       'scraping.',
                        'name': 'web-scraper',
                        'objectID': 'moJRLRc85AitArpNN',
                        'pictureUrl': 'https://apify-image-uploads-prod.s3.amazonaws.com/moJRLRc85AitArpNN/Zn8vbWTika7anCQMn-SD-02-02.png',
                        'stats': {   'lastRunStartedAt': '2022-03-06T21:57:00.831Z',
                                     'totalBuilds': 104,
                                     'totalMetamorphs': 102660,
                                     'totalRuns': 68036112,
                                     'totalUsers': 23492,
                                     'totalUsers30Days': 1726,
                                     'totalUsers7Days': 964,
                                     'totalUsers90Days': 3205},