Search code examples
pythongoogle-cloud-platformrequest-headerspython-requests-html

Missing some tag HTML while scraping using requests python on GCP VM


I'm trying to scrape a website, the results were as expected if I run my code on my own local server, but if I deploy to a GCP VM, some of the HTML tags are missing. I've made sure that the source code is the same both locally and on GCP.

Of interest is the fact that if I change my headers, then I get more missing tags. So far, I've found that these headers work the best:

headers = {
"User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.3 Mobile/15E148 Safari/604.1 Edg/87.0.4280.141",
"Content-Type": "application/x-www-form-urlencoded",
"Connection": "keep-alive"}

Is the missing tags problem caused by the headers being sent, or by something else happening in the GCP VM?


Solution

  • To recap troubleshooting done in comments:

    • GCP by itself does not filter headers.
    • Depending on the website, scraping results may differ because of different IP.
    • If you encounter any discrepancies between dumps made locally and on GCP, make sure code and all dependencies are the same.

    You can find more information about scraping from GCP here.