Search code examples
web-scrapingxmlhttprequestgoogle-chrome-devtoolsfetch-apiweb-storage

how to find source of dynamically loaded content


i want to scrape the entries in this table. it is apparently populated by javascript after the page loads, so rather than scraping (with something like webdriver), i'd like to directly request the data from whatever service the javascript is talking to.

using chrome dev tools' network tab, i thought i'd narrowed it down to an xhr POST to https://www.oregon.gov/oha/ERD/_vti_bin/client.svc/ProcessQuery, but the response shown doesn't look related, and none of the other network activity items seem to be either.

how do i track down exactly what request is populating the table?


Solution

  • HTML5 introduced web-storage, which, like cookies, caches data locally. this can prevent data requests after first loading a site. in chrome dev tools, go to the application tab, and under storage, look for a key that has the data you want. if it's there, you can clear the storage, refresh, and then you'll see either an xhr or fetch [1] request in the network tab that got the data. you can right-click the request and copy it as a curl command to request the data directly with no scraping. you might worry that the service will prevent access from outside its approved web front end, but cors can't stop you because it only applies to browsers.

    [1] fetch is an improved xhr available since 2015

    thank you to @sideshowbarker for pointing me to sessionStorage and answering my cors questions.