python html web-scraping scrapy documentation

Web scraping redoc web api

I am currently learning web scraping and one of my tasks is to do so with a API documentation that uses Redoc: OpenAPI/Swagger-generated API Reference Documentation: https://github.com/Redocly/redoc)

To learn the structure, I went to their github and clicked on the live demo.

I am using Scrapy and here's the code that I am using to simply extract the HTML of the website:

import scrapy

class QuoteSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://redocly.github.io/redoc/'
    ]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f'quotes-{page}.html'
        with open(filename, 'wb') as f:
            f.write(response.body)

The issue is, after the scraper running its course, a new file is created, as expected, however it is missing a large portion of the HTML (inside the container div).

Has any of you had this issue, not especifically with redoc? If so, how did you solve it? Do you think it's a configuration of this documentation generator that does not allow it to be scraped?

Thank you!

Solution

Redoc is a React app which means the actual HTML is being built in runtime:

first the skeleton of the page loads, which also loads redoc javascript
then the Redoc downloads the OpenAPI json (or yaml) file and renders the actual HTML dynamically based on it

This is similar for many apps build with modern JS frameworks (vuejs, react, angular). To scrape these you have to actually load the page in a browser to run all the javascript.

I believe the most common way to do it nowadays is to use puppeteer (there is a python binding: https://github.com/pyppeteer/pyppeteer/)