python-3.xgoogle-bigquerystreamgoogle-cloud-functionsgoogle-cloud-pubsub

Differences between BigQuery BQ.insert_rows_json and BQ.load_from_json?


I want to stream data into BigQuery and I was thinking in use PubSub + Cloud Functions, since there is no transformation needed (for now, at least) and using Cloud Data Flow feels like a little bit over kill for just inserting rows to a table. I am correct?

The data is streamed from a GCP VM using a Python script into PubSub and it has the following format:

{'SEGMENT':'datetime':'2020-12-05 11:25:05.64684','values':(2568.025,2567.03)}

The BigQuery schema is datetime:timestamp, value_A: float, value_B: float.

My questions with all this are:

a) Do I need to push this into BigQuery as json/dictionary with all values as strings or it has to be with the data type of the table?

b) What's the difference between using BQ.insert_rows_json and BQ.load_table_from_json and which one should I use for this task?


EDIT:

What I'm trying to get is actually market data of some assets. Say around 28 instruments and capture all their ticks. On an average day, there are ~60.k ticks per instrument, so we are talking about ~33.6 M invocations per month. What is needed (for now) is to insert them in a table for further analysis. I'm currently not sure if real streaming should be performed or loads per batch. Since the project is in doing analysis yet, I don't feel that Data Flow is needed, but PubSub should be used since it allows to scale to Data Flow easier when the time comes. This is my first implementation of doing streaming pipelines and I'm using all what I've learned through courses and reading. Please, correct me if I'm having a wrong approach :).

What I would absolutely love to do is, for example, perform another insert to another table when the price difference between one tick and the n'th tick is, for example, 10. For this, should I use Data Flow or the Cloud Function approach is still valid? Because this is like a trigger condition. Basically, the trigger would be something like:

if price difference >= 10:
     process all these ticks
     insert the results in this table

But I'm unsure how to implement this trigger.


Solution

  • In addition to the great answer of Marton (Pentium10)

    a) You can stream a JSON in BigQuery, a VALID json. your example isn't. About the type, there is an automatic coercion/conversion according with your schema. You can see this here

    b) The load job loads file in GCS or a content that you put in the request. The batch is asynchronous and can take seconds or minutes. In addition, you are limited to 1500 load per days and per table -> 1 per minutes works (1440 minutes per day). There is several interesting aspect of the load job.

    1. Firstly, it's free!
    2. Your data are immediately loaded in the correct partition and immediately request-able in the partition
    3. If the load fail, no data are inserted. So, it's easiest to replay a file without having doubled values.

    At the opposite, the streaming job insert in real time the data into BigQuery. It's interesting when you have real time constraint (especially for visualisation, anomalie detections,...). But there is some bad sides

    1. You are limited to 500k rows per seconds (in EU and US), 100k rows in other regions, and 1Gb max per seconds
    2. The data aren't immediately in the partition, they are in a buffer name UNPARTITIONED for a while or up to have this buffer full.. So you have to take into account this specificity when you build and test your real time application.
    3. It's not free. The cheapest region is $0.05 per Gb.

    Now that you are aware of this, ask yourselves about your use case.

    • If you need real time (less than 2 minutes of delay), no doubt, streaming is for you.
    • If you have few Gb per month, streaming is also the easiest solution, for few $
    • If you have a huge volume of data (more than 1Gb per second), BigQuery isn't the good service, consider BigTable (that you can request with BigQuery as a federated table)
    • If you have an important volume of data (1 or 2Gb per minutes) and your use case required data freshness at the minute+, you can consider a special design
    1. Create a PubSub pull subscription
    2. Create a HTTP triggered Cloud Function (or a Cloud Run service) that pull the subscription for 1 minutes and then submit the pulled content to BigQuery as a load job (no file needed, you can post in memory content directly to BigQuery). And then exist gracefully
    3. Create a Cloud Scheduler that trigger your service every minute.

    Edit 1:

    The cost shouldn't drive your use case.

    If, for now, it's only for analytics, you simply imagine to trigger once per days your job to pull the full subscriptions. With your metrics: 60k metrics * 28 instruments * 100 bytes (24 + memory loss), you have only 168Mb. You can store this in Cloud Functions or Cloud Run memory and perform a load job.

    Streaming is really important for real time!

    Dataflow, in streaming mode, will cost you, at least $20 per month (1 small worker of type n1-standard1. Much more than 1.5Gb of streaming insert in BigQuery with Cloud Functions.

    Eventually, about your smart trigger to stream or to batch insert, it's not really possible, you have to redesign the data ingestion if you change your logic. But before all, only if your use case requires this!!