Search code examples
postgresqlselectpostgresql-13

How can I merge rows of a table that records history of changes in order to obtain what a row looked like at a specific moment?


I have the following 2 SELECTs:

SELECT * FROM public.app_user WHERE id = 'e31b55bf';
+--------+----+-----------+-----+-----+--------------------------+
|id      |name|email      |role |bio  |created_at                |
+--------+----+-----------+-----+-----+--------------------------+
|e31b55bf|Jon |jon@app.com|admin|Hello|2022-01-01 00:00:00.000000|
+--------+----+-----------+-----+-----+--------------------------+

SELECT * FROM history.app_user WHERE id = 'e31b55bf';
+--------+----+--------------+--------+----+--------------------------+
|id      |name|email         |role    |bio |updated_at                |
+--------+----+--------------+--------+----+--------------------------+
|e31b55bf|ASDF|test          |NULL    |NULL|2022-01-02 00:00:00.000000|
|e31b55bf|Test|test@gmail.com|basic   |NULL|2022-01-03 00:00:00.000000|
|e31b55bf|NULL|NULL          |standard|asdf|2022-01-04 00:00:00.000000|
|e31b55bf|NULL|NULL          |mod     |NULL|2022-01-05 00:00:00.000000|
+--------+----+--------------+--------+----+--------------------------+

public.app_user contains the users of my app and history.app_user contains a record of previous values of the rows of the first. In the example above, the user e31b55bf was a mod instead of an admin before the 5th of January, an standard user with the bio "asdf" before the 4th, a basic user named "Test" with the email "test@gmail.com" before the 3rd...

I want to come up with a SELECT or a FUNCTION that will tell me what the row looked like at a specific point in time. I believe I have already accomplished it, but my solution looks more complex than it should be. It's also tedious to translate it to other tables: say public.project and history.project, for example, where the columns are totally different. I believe a cleaner, easier to read and write solution exists. Can a SQL wizard help me here?

My current solution consists of overriding the values of the current row at public.app_user with the non-null values of this:

SELECT t.uuid,
       t1.name,
       t2.email,
       t3.role,
       t4.bio
FROM (
     SELECT uuid,
            MIN(CASE WHEN name IS NOT NULL THEN updated_at END)       AS name_date,
            MIN(CASE WHEN email IS NOT NULL THEN updated_at END)      AS email_date,
            MIN(CASE WHEN role IS NOT NULL THEN updated_at END)       AS role_date,
            MIN(CASE WHEN bio IS NOT NULL THEN updated_at END)        AS bio_date
     FROM history.app_user
     WHERE updated_at > '2022-01-03 12:00:00.000000' -- Date to check
     GROUP BY uuid
     ) t
     LEFT JOIN history.app_user t1 ON t1.updated_at = t.name_date
     LEFT JOIN history.app_user t2 ON t2.updated_at = t.email_date
     LEFT JOIN history.app_user t3 ON t3.updated_at = t.role_date
     LEFT JOIN history.app_user t4 ON t4.updated_at = t.bio_date

Solution

  • Solution 1 : window function

    SELECT DISTINCT ON (uuid)
         , uuid
         , (array_agg(name) FILTER (WHERE name IS NOT NULL) OVER w)[1] AS name
         , (array_agg(email) FILTER (WHERE email IS NOT NULL) OVER w)[1] AS email
         , (array_agg(role) FILTER (WHERE role IS NOT NULL) OVER w)[1] AS role
         , (array_agg(bio) FILTER (WHERE bio IS NOT NULL) OVER w)[1] AS bio 
      FROM history.app_user
     WHERE updated_at > '2022-01-03 12:00:00.000000' -- Date to check
    WINDOW w AS (PARTITION BY uuid ORDER BY updated_at)
    

    array_agg() is an aggregate function which is used as a window function here.

    FILTER (WHERE condition) is used to exclude the NULL values from the selected rows.

    The window is a subset of rows associated to the current row as described in the manual, ie all the existing rows with the same uuid as stated in the PARTITION BY clause. The ORDER BY clause allows to put the earliest non-null value in the first position of the resulting array and which is selected by [1].

    The main issue of using here window functions is that we get as many rows as the ones filtered by the WHERE clause. The DISTINCT ON () clause excludes the redundant rows from the final result.

    The second solution based on the same function used as an aggregate function instead of a window function is more accurate in our case.

    Solution 2 : aggregate function

    SELECT uuid
         , (array_agg(name ORDER BY updated_at) FILTER (WHERE name IS NOT NULL))[1] AS name
         , (array_agg(email ORDER BY updated_at) FILTER (WHERE email IS NOT NULL))[1] AS email
         , (array_agg(role ORDER BY updated_at) FILTER (WHERE role IS NOT NULL))[1] AS role
         , (array_agg(bio ORDER BY updated_at) FILTER (WHERE bio IS NOT NULL))[1] AS bio 
      FROM history.app_user
     WHERE updated_at > '2022-01-03 12:00:00.000000' -- Date to check
     GROUP BY uuid