Search code examples
pythongoogle-app-enginegoogle-cloud-platformgoogle-compute-enginegoogle-cloud-run

Execute very long-running tasks using Google Cloud


I have been using Google CLoud for a few weeks now and I am facing a big problem for my limited GCP knowledge.

I have a python project whos goal is to "scrape" datas from a website using it's API. My project run a few tens of thousands of requests during executions and it can take very long (few hours, maybe more)

I have 4 python scripts in my project and it's all orchestrated by a bash script.

The execution is as follow :

  • The first script check a CSV file with all the instructions for the requests, and exeute the requests, save all the results from the requests in CSV files
  • Second script check the previously created CSV files and recreate an other CSV instruction file
  • The first script run again but with the new instructions and again save results in CSV files
  • Second script checks again and do the same again ...
  • ... and so on a few times
  • Third script cleans the datas, delete duplicates and create an unique CSV file
  • Fourth script upload the final CSV file to bucket storage

Now I want to get ride of that bash script and I would like to automatize execution of thos scripts approx. once a week.

The problem here is the execution time. Here is what I already tested :

Google App Engine : The timeout of a request on GAE is limited to 10 minutes, and my functions can run for few hours. GAE is not usable here.

Google Compute Engine : My scripts will run max. 10-15 hours a week, keeping a compute engine up during all that time would be too pricey.

What could I do to automatize the execution of my scripts in a cloud environment ? What could be solutions I didn't though about, without changing my code ?

Thank you


Solution

  • A simple way to accomplish this without the need to get rid of the existing bash script that orchestrates everything would be:

    1. Include the bash script on the startup script for the instance.
    2. At the end of the bash script, include a shutdown command.
    3. Schedule the starting of the instance using Cloud Scheduler. You'll have to make an authenticated call to the GCE API to start the existing instance.

    With that, your instance will start on a schedule, it will run the startup script (that will be your existing orchestrating script), and it will shut down once it's finished.