Search code examples
sqlgoogle-bigquerybigquery-udf

How to check if two table include the same information in Big Query?


Assume we have a table table_a with two string and one int fields:

CREATE TABLE `YOUR_DATABASE.table_a` (
  `A1` STRING,
  `A2` INT64,
  `A3` STRING,
);

table_a includes some data:

INSERT `YOUR_DATABASE.table_a` (A1, A2, A3)
VALUES("X111", 10, "2022-02-07 08:19:00"),
      ("Y222", 15, "2017-05-08 10:19:00"),
      ("X111", 0,"2011-08-02 02:30:00"),
      ("X111", 12, "2019-12-20 11:25:00"),
      ("Y222", 258, "2001-10-14 07:00:00"),
      ("Z333", 4, "2015-03-08 03:25:12")

We have another table, table_b, which is kind of identical to table_a. In fact, table_b has the exact same fields and values as in table_a and it also has an extra DATETIME field. This extra DATETIME field, B4, is essentially same as A3 but converted to a standard DATETIME format.

CREATE TABLE `YOUR_DATABASE.table_b` (
  `B1` STRING,
  `B2` INT64,
  `B3` STRING,
  `B4` DATETIME,
);

And the values in table_b are:

INSERT `YOUR_DATABASE.table_b` (B1, B2, B3, B4)
VALUES("X111", 10, "2022-02-07 08:19:00", DATETIME(PARSE_TIMESTAMP("%Y-%m-%y %H:%M:%S","2022-02-07 08:19:00"),"America/Los_Angeles")),
      ("Y222", 15, "2017-05-08 10:19:00", DATETIME(PARSE_TIMESTAMP("%Y-%m-%y %H:%M:%S","2017-05-08 10:19:00"),"America/Los_Angeles")),
      ("X111", 0, "2011-08-02 02:30:00", DATETIME(PARSE_TIMESTAMP("%Y-%m-%y %H:%M:%S","2011-08-02 02:30:00"),"America/Los_Angeles")),
      ("X111", 12, "2019-12-20 11:25:00", DATETIME(PARSE_TIMESTAMP("%Y-%m-%y %H:%M:%S","2019-12-20 11:25:00"),"America/Los_Angeles")),
      ("Y222", 258, "2001-10-14 07:00:00", DATETIME(PARSE_TIMESTAMP("%Y-%m-%y %H:%M:%S","2001-10-14 07:00:00"),"America/Los_Angeles")),
      ("Z333", 4, "2015-03-08 03:25:12", DATETIME(PARSE_TIMESTAMP("%Y-%m-%y %H:%M:%S","2015-03-08 03:25:12"),"America/Los_Angeles"))

Note, table_a and table_b may or may not have a primary key. How can I check if table_a and table_b include the same information? I have a couple of table_a in my database and I have created equivalent table_b for each of them; now I want to double check and make sure the tables include same information.


Solution

  • Try the following:

    SELECT
        *
    FROM table_a
    FULL OUTER JOIN table_b
     on FARM_FINGERPRINT(ARRAY_TO_STRING([A1, CAST(A2 AS STRING), A3],"")) = FARM_FINGERPRINT(ARRAY_TO_STRING([B1, CAST(B2 AS STRING), B3],""))
    WHERE A1 IS NULL 
    OR B1 IS NULL
    

    This process concatenates and hashes the row from each table then returns where there is not a match on each side.

    With the sample data provided it yields: enter image description here