Search code examples
google-cloud-platformgoogle-cloud-functions

Are cloud functions an appropriate solution for running a long webscraping job?


I'm using a GCP cloud function, planning to trigger runs via requests. My function runs a web scraper to build a dataset, which should probably take around 30 minutes. I think I saw GCP options for scheduling jobs, but in my case I need to manually renew and pass in an API key every run and I thought it'd be easier to do this by making HTTP requests to the GCP function endpoint and passing the API key in the request body.

However, the whole job has to run before my cloud function can send back a response, so it'll always timeout from the requester side. This doesn't interfere with the ability of my job to run, but it seems to imply that cloud functions aren't made for running long jobs. Are GCP cloud functions a bad fit for what I'm trying to do?


Solution

  • No, I would not use cloud functions that take more than a minute or two to run as part of a browser-based user interface. You can run them for 9 minutes[v1] or even 60 minutes [v2], and have them spawn other cloud functions that keep running longer, but that's a lot of complexity. While the longer running cloud functions could be attached to HTTPS, attaching them to some other event -- like on-upload of a file to a storage bucket -- might make more sense. Calling a function that takes 9 minutes to finish directly from a web browser seems inadvisable.

    Long-running occasional processes that work in isolation from each other can be in their own VMs. You can install any collection of software you like into a VM, and either save it to an image or to a snapshot and use those to start a new custom VM. You can call the Compute Engine API from a Google Cloud Function to create new instances of the custom VM. Starting a VM takes maybe 30 seconds. If you prefer, use the success/failure result from the cloud function to tell the user the task has started (or not), and to come back in about an hour or whatever to see the result (or try again). Have the custom VM post the results of the scraping where the website can pick it up for display to the end user.

    Cloud functions are more expensive for the vCPUs and RAM that are obtained. On-demand VMs may then have lower operational cost than cloud functions. Spot VMs are cheaper still, but might not finish.

    You can limit VM runtime on creation (so that its not still running 3 months from now, generating big bills), and set a specific service account for the VM to establish its rights to access other resources -- such as read-only or read-write storage buckets. There's a metadata field for VM creation that might be useful for communicating job parameters.

    Google Batch is layered over VMs and will provide some additional management, but you might find you dont need it. Outside of Google Batch there are other batch services within Google -- such as batch jobs in kubernetes, and hosted Argo Workflow, but if you aren't using these already they are probably too much for anything reasonably simple.