Search code examples
google-cloud-platformgoogle-cloud-dataflowgoogle-cloud-pubsubgoogle-cloud-iot

Dataflow resource usage


After following the dataflow tutorial, I used the pub/sub topic to big query template to parse a JSON record into a table. The Job has been streaming for 21 days. During that time I have ingested about 5000 JSON records, containing 4 fields (around 250 bytes).

After the bill came this month I started to look into resource usage. I have used 2,017.52 vCPU hr, memory 7,565.825 GB hr, Total HDD 620,407.918 GB hr.

This seems absurdly high for the tiny amount of data I have been ingesting. Is there a minimum amount of data I should have before using dataflow? It seems over powered for small cases. Is there another preferred method for ingesting data from a pub sub topic? Is there a different configuration when setting up a Dataflow Job that uses less resources?


Solution

  • It seems that the numbers you mentioned, correspond to not customizing the job resources. By default streaming jobs use a n1-standar-4 machine:

    3 Streaming worker defaults: 4 vCPU, 15 GB memory, 400 GB Persistent Disk.
    4 vCPU x 24 hrs x 21 days = 2,016
    15 GB x 24 hrs x 21 days = 7,560

    If you really need streaming in Dataflow, you will need to pay for resources allocated even if there is nothing to process.

    Options:

    Optimizing Dataflow

    • Considering that the number and size of the JSON string you need to process are really small, you can reduce the cost to aprox 1/4 of current charge. You just need to set the job to use a n1-standard-1 machine, which has 1vCPU and 3.75GB memory. Just be careful with max nodes, unless you are planning increase the load, one node may be enough.

    Your own way

    • If you don't really need streaming (not likely), you can just create a function that pulls using Synchronous Pull, and add the part that writes to BigQuery. You can schedule according to your needs.

    Cloud functions (my recommendation)

    "Cloud Functions provides a perpetual free tier for compute-time resources, which includes an allocation of both GB-seconds and GHz-seconds. In addition to the 2 million invocations, the free tier provides 400,000 GB-seconds, 200,000 GHz-seconds of compute time and 5GB of Internet egress traffic per month."[1]

    [1] https://cloud.google.com/functions/pricing