Search code examples
performancegoogle-bigquerynumeric

Is Numeric with specified precision/scale more efficient in BigQuery


I am finding that a lot of the time I only need int32 or even int16 in many cases, however BigQuery doesn't support any variations of INT other than Int64 (although has aliases that make it appear they support things like SmallInt).

I was wondering if anyone had insight on whether using Numeric with Scale set to 0 and precision set to 10 to represent i32 values would be more effective on storage and/or computation costs. I am not sure if BQ effectively utilizes the scale/precision parameters, but I am not sure why they would offer them if they didn't intend on utilizing them to maximize the efficiency of storage.

Reviewed google documentation, I couldn't find anything on optimizing table schemas, just optimizing queries. If you have any credible documentation pointing to most efficient schemas, please share!


Solution

  • As long you do not work with fixed slots, the computation costs depend only on how much of data from storage is queried. Therefore, storing data effective is a good idea.

    The smallest numeric value is int64. All other numeric format need more space. For numeric data types, "Applying restrictions with precision and scale does not impact the storage size of the underlying data type."

    However, having non signed int32 values, these can be saved more efficient in a string. The largest value of an int16 is 65.536, this string has a length of 5 bytes. Adding two bytes for encoding the string length, we end up by 7 bytes for one entry. One byte less than the int64 data type.

    Of course, two non signed int32 values (or four int16) could be converted into one int64 by a*65536+b. However, no one would understand this data in the table anymore.

    To understand the storage in detail, we generate a dummy table having several columns:

    CREATE OR REPLACE TABLE
      Test.aaa AS
    SELECT
      "65000" AS str,
      65000 AS val,
      x,
      CAST(64000 AS numeric ) AS num,
      CAST(64000 AS BIGNUMERIC ) AS bignum,
    FROM
      UNNEST(GENERATE_ARRAY(1,10000)) x
    

    Building a query for each column

    SELECT
      bignum
    FROM
      `Test.aaa`
    

    shows us the amount of storage data in that column.

    column data type data type size query amount
    str string 7 68.36 kB 
    val int64 8 78.13 kB 
    x int64 8 78.13 kB 
    num numeric 16 156.25 kB 
    bignum bignum 32 312.5 kB 

    Since the table has 10.000 rows, the data type size times 10.000 divided by 1024 gives the query amount.