I am currently learning web scraping and one of my tasks is to do so with a API documentation that uses Redoc: OpenAPI/Swagger-generated API Reference Documentation: https://github.com/Redocly/redoc)
To learn the structure, I went to their github and clicked on the live demo.
I am using Scrapy and here's the code that I am using to simply extract the HTML of the website:
import scrapy
class QuoteSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://redocly.github.io/redoc/'
]
def parse(self, response):
page = response.url.split("/")[-2]
filename = f'quotes-{page}.html'
with open(filename, 'wb') as f:
f.write(response.body)
The issue is, after the scraper running its course, a new file is created, as expected, however it is missing a large portion of the HTML (inside the container div).
Has any of you had this issue, not especifically with redoc? If so, how did you solve it? Do you think it's a configuration of this documentation generator that does not allow it to be scraped?
Thank you!
Redoc is a React app which means the actual HTML is being built in runtime:
This is similar for many apps build with modern JS frameworks (vuejs, react, angular). To scrape these you have to actually load the page in a browser to run all the javascript.
I believe the most common way to do it nowadays is to use puppeteer (there is a python binding: https://github.com/pyppeteer/pyppeteer/)