Search code examples
pythonweb-scrapingxmlhttprequest

how to automatically retrieve the request URL (for python) from an XHR request of a page loaded via JavaScript


Here is the URL that I'm trying to scrape: https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019319000076/a10-qq320196292019.htm

I'm trying to scrape the webpage using Python which mean I will require the XHR request for this page as it is loaded via JavaScript.

Upon inspection of the Network under Developer Tools, I can see the XHR request: a10-qq320196292019.htm which produces the request URL: https://www.sec.gov/Archives/edgar/data/320193/000032019319000076/a10-qq320196292019.htm

My question is two-fold,

  1. How can I automatically get this request URL if i am only accessing using the URL given initially and,
  2. How do I know that this is THE XHR request I need? This particular URL works for my needs but I noticed that there were many other XHR reqeusts as well. How does one differentiate?

Solution

  • In this case, I don't think you need to go that route. The link you're using is an ixbrl view of the actual html document. The url for the html doc is embedded in that first link. All you have to do is extract it:

    url = 'https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019319000076/a10-qq320196292019.htm'
    html_url = url.replace('/ix?doc=','')
    html_url
    

    Output:

    'https://www.sec.gov/Archives/edgar/data/320193/000032019319000076/a10-qq320196292019.htm