Search code examples
sqljoingoogle-bigquerycommon-table-expressionbitcoin

Standard SQL: rewrite an explicit cross join to WITH clause


Consider a table transactions which has two JSONB fields outputs and inputs. The question is how can one rewrite this query using a WITH clause?

-- Note: This query will process 111.85 MB when run.
SELECT
    transactions.hash AS CREATED_TX_HASH,
    transactions.block_number AS CREATED_BLOCK_ID,
    transactions.block_timestamp AS CREATED_BLOCK_TIME,
    outputs.index AS CREATED_INDEX,
    outputs.value / 1e8 AS OUTPUT_VALUE_BTC,
    transactions.hash AS SPENT_CREATED_TX_HASH,
    transactions.block_number AS SPENDING_BLOCK_ID,
    transactions.block_timestamp AS SPENDING_BLOCK_TIME,
    inputs.index AS SPENT_CREATED_INDEX,
    inputs.spent_transaction_hash as SPENDING_TX_HASH,
    inputs.spent_output_index AS SPENDING_INDEX,
    inputs.value / 1e8 AS INPUT_VALUE_BTC
FROM `bigquery-public-data.crypto_bitcoin.transactions` as transactions
CROSS JOIN
    transactions.outputs as outputs
CROSS JOIN
    transactions.inputs as inputs
-- FROM `bigquery-public-data.crypto_bitcoin.transactions` as transactions,
--     transactions.outputs as outputs,
--     transactions.inputs as inputs   
WHERE transactions.block_timestamp_month < '2009-02-01' 
ORDER BY 3

What I need is to create CTEs in order to keep temporary result sets as below:

WITH outputs AS (
  SELECT
      transactions.hash AS CREATED_TX_HASH,
      transactions.block_number AS CREATED_BLOCK_ID,
      transactions.block_timestamp AS CREATED_BLOCK_TIME,
      outputs.index AS CREATED_INDEX,
      outputs.value / 1e8 AS OUTPUT_VALUE_BTC
  FROM `bigquery-public-data.crypto_bitcoin.transactions` as transactions,
      transactions.outputs as outputs
  WHERE transactions.block_timestamp_month < '2009-02-01'  

), inputs AS (

  SELECT
    transactions.hash AS SPENT_CREATED_TX_HASH,
    transactions.block_number AS SPENDING_BLOCK_ID,
    transactions.block_timestamp AS SPENDING_BLOCK_TIME,
    inputs.index AS SPENT_CREATED_INDEX,
    inputs.spent_transaction_hash as SPENDING_TX_HASH,
    inputs.spent_output_index AS SPENDING_INDEX,
    inputs.value / 1e8 AS INPUT_VALUE_BTC
  FROM `bigquery-public-data.crypto_bitcoin.transactions` as transactions,
      transactions.inputs as inputs
  WHERE transactions.block_timestamp_month < '2009-02-01'
)

But I do not know which SELECT statement on these two CTEs produces the same result as the original query above.


Solution

  • You'll need to join them by the CREATED_BLOCK_ID and the SPENDING_BLOCK_ID, Additionally I used the ROW_NUMBER statement to avoid duplicated values.

    Below Query should work fine for you:

        WITH outputs AS (
      SELECT
          transactions.hash AS CREATED_TX_HASH,
          transactions.block_number AS CREATED_BLOCK_ID,
          transactions.block_timestamp AS CREATED_BLOCK_TIME,
          outputs.index AS CREATED_INDEX,
          outputs.value / 1e8 AS OUTPUT_VALUE_BTC
      FROM `bigquery-public-data.crypto_bitcoin.transactions` as transactions,
          transactions.outputs as outputs
      WHERE transactions.block_timestamp_month < '2009-02-01'  
    
    ), inputs AS (
    
      SELECT
        transactions.hash AS SPENT_CREATED_TX_HASH,
        transactions.block_number AS SPENDING_BLOCK_ID,
        transactions.block_timestamp AS SPENDING_BLOCK_TIME,
        inputs.index AS SPENT_CREATED_INDEX,
        inputs.spent_transaction_hash as SPENDING_TX_HASH,
        inputs.spent_output_index AS SPENDING_INDEX,
        inputs.value / 1e8 AS INPUT_VALUE_BTC
      FROM `bigquery-public-data.crypto_bitcoin.transactions` as transactions,
          transactions.inputs as inputs
      WHERE transactions.block_timestamp_month < '2009-02-01'
    )
    SELECT * from 
    (
      SELECT * , 
      ROW_NUMBER() OVER(PARTITION BY CREATED_BLOCK_ID, CREATED_INDEX, SPENDING_BLOCK_ID, SPENT_CREATED_INDEX, CREATED_TX_HASH, SPENT_CREATED_TX_HASH
                        ORDER BY CREATED_BLOCK_TIME DESC) as last
       from outputs o join inputs i 
    on  o.CREATED_BLOCK_ID=SPENDING_BLOCK_ID 
    order by o.CREATED_BLOCK_ID, o.CREATED_BLOCK_TIME, o.CREATED_INDEX, o.CREATED_TX_HASH  
    )
    WHERE last = 1 AND CREATED_TX_HASH = SPENT_CREATED_TX_HASH
    

    The output looks like:

    enter image description here

    Finally I would recommend you to use the CROSS JOIN query since this function has a better performance than the use of subqueries using the WITH clause.