Search code examples
google-cloud-platformgoogle-cloud-spanner

How to easily get most recent version of data for all keys?


I'm trying to create some tables with the following characteristics:

  • Stores the all versions of data for a particular key
  • Able to easily get the most recent versions of data for all keys

My schema is as follows:

CREATE TABLE WebSitesStatus(
  -- Host name of the site in question.
  Host String(40) NOT NULL,
  -- When the information on the site was last updated.
  UpdateDate Date NOT NULL,

  HttpsWorks BOOL NOT NULL,
  ModernTls BOOL NOT NULL,

) PRIMARY KEY(Host, UpdateDate DESC);

So an example subset of the table might be:

+--------------+------------+------------+-----------+
| Host         | UpdateDate | HttpsWorks | ModernTls |
+--------------+------------+------------+-----------+
| foobar.com   | 2016-10-14 | true       | false     |
| google.com   | 2016-10-14 | true       | true      |
| google.com   | 2016-04-23 | false      | false     |
| nytimes.com  | 2016-10-14 | false      | false     |
| nytimes.com  | 2016-05-25 | true       | true      |
| nytimes.com  | 2016-04-25 | true       | false     |
| nytimes.com  | 2016-04-23 | true       | false     |
| paypal.com   | 2016-10-14 | true       | false     | 
+--------------+------------+------------+-----------+

I'd like to find a way to get back

+--------------+------------+------------+-----------+
| Host         | UpdateDate | HttpsWorks | ModernTls |
+--------------+------------+------------+-----------+
| foobar.com   | 2016-10-14 | true       | false     |
| google.com   | 2016-10-14 | true       | true      |
| nytimes.com  | 2016-10-14 | false      | false     |
| paypal.com   | 2016-10-14 | true       | false     |
+--------------+------------+------------+-----------+

but the structure isn't allowing me to easily do this. Is there a way to do this query easily?


Solution

  • This query:

    SELECT t.Host, MAX(t.UpdateDate) AS UpdateDate
    FROM WebSitesStatus AS t
    GROUP BY t.Host
    

    returns just the primary keys of the rows you want:

    +--------------+------------+
    | Host         | UpdateDate |
    +--------------+------------+
    | foobar.com   | 2016-10-14 |
    | google.com   | 2016-10-14 |
    | nytimes.com  | 2016-10-14 |
    | paypal.com   | 2016-10-14 |
    +--------------+------------+
    

    To get the other columns you must JOIN this back to the same table, like this:

    SELECT p.Host, p.UpdateDate, q.HttpsWorks, q.ModernTls,
    FROM (SELECT t.Host, MAX(t.UpdateDate) AS UpdateDate
          FROM WebSitesStatus AS t 
          GROUP BY t.Host) AS p
    JOIN WebSitesStatus AS q
    ON p.Host = q.Host AND p.UpdateDate = q.UpdateDate;
    

    That yields this data, which is what you wanted:

    +--------------+------------+------------+-----------+
    | Host         | UpdateDate | HttpsWorks | ModernTls 
    +--------------+------------+------------+-----------+
    | amazon.co.uk | 2016-10-14 | true       | false     |
    | amazon.com   | 2016-10-14 | true       | true      |
    | nytimes.com  | 2016-10-14 | false      | false     |
    | paypal.com   | 2016-10-14 | true       | false     |
    +--------------+------------+------------+-----------+
    

    This is somewhat inefficient, because it scans your table twice: once to get the primary key values and then again to fetch all the other values. But it's probably the cleanest approach.