Search code examples
rubygoogle-bigquerygoogle-api-ruby-client

Bigquery Streaming inserts, persistent or new http connection on every insert?


I am using google-api-ruby-client for Streaming Data Into BigQuery. so whenever there is a request. it is pushed into Redis as a queue & then a new Sidekiq worker tries to insert into bigquery. I think its involves opening a new HTTPS connection to bigquery every insert.

the way, I have it setup is: Events post every 1 second or when the batch size reaches 1MB (one megabyte), whichever occurs first. This is per worker, so the Biquery API may receive tens of HTTP posts per second over multiple HTTPS connections.

This is done using the provided API client by Google.

Now the Question -- For Streaming inserts, what is the better approach:-

  1. persistent HTTPS connection. if yes, then should it be a global connection that's shared across all requests? or something else?
  2. Opening new connection. like we are doing now using google-api-ruby-client

Solution

  • I think it's pretty much too early to speak about these optimizations. Also other context is missing like if you exhausted the kernel's TCP connections or not. Or how many connections are in TIME_WAIT state and so on.

    1. Until the worker pool doesn't reach 1000 connections per second on the same machine, you should stick with the default mode the library offers

    Otherwise this would need lots of other context and deep level of understanding how this works in order to optimize something here.

    1. On the other hand you can batch more rows into the same streaming insert requests, the limits are:

    Maximum row size: 1 MB
    HTTP request size limit: 10 MB
    Maximum rows per second: 100,000 rows per second, per table.
    Maximum rows per request: 500
    Maximum bytes per second: 100 MB per second, per table

    1. Read my other recommendations Google BigQuery: Slow streaming inserts performance

    I will try to give also context to better understand the complex situation when ports are exhausted:

    Let's say on a machine you have a pool of 30,000 ports and 500 new connections per second (typical):

    1 second goes by you now have 29500 
    10 seconds go by you now have 25000 
    30 seconds go by you now have 15000 
    at 59 seconds you get to 500, 
    at 60 you get back 500 and stay at using 29500 and that keeps rolling at 
    29500. Everyone is happy. 
    

    Now say that you're seeing an average of 550 connections a second. Suddenly there aren't any available ports to use.

    So, your first option is to bump up the range of allowed local ports; easy enough, but even if you open it up as much as you can and go from 1025 to 65535, that's still only 64000 ports; with your 60 second TCP_TIMEWAIT_LEN, you can sustain an average of 1000 connections a second. Still no persistent connections are in use.

    This port exhaust is better discussed here: http://www.gossamer-threads.com/lists/nanog/users/158655