Search code examples
mysqlsqldatetimegreatest-n-per-groupsql-null

Consolidating a dataset by single field (mysql)


I have a table of transactions ('transactions_2020') that include an email address, transaction details, date, etc. These transactions include address and other PII information.

Multiple transactions per email address are common in the table. I want to create a table of unique email addresses ('individuals') and keep all related PII information. For those cases where there are multiple transactions per email address

I'd like to keep the values for the columns associated with the most recent transaction, but only if those fields are not null. Resulting in a consolidated row in my 'individuals' table that has the best/most recent information, even if that information comes from different transactions. Simple example below (blanks are null):

transactions table

email_address   trans_date  address1    address2    birthdate
email1@none.com 2020-10-01                          2000-01-01
email1@none.com 2020-09-01              Box 123 
email1@none.com 2020-08-01  123 Main        
email2@none.com 2020-12-01  456 Elm                 2000-03-01
email2@none.com 2020-07-01  123 Elm                 2000-02-01
email3@none.com 2020-11-01  123 Maple               2000-05-01
email3@none.com 2020-09-01  123 Maple   Box 123 
            

individual table

email_address   address1    address2    birthdate   
email1@none.com 123 Main    Box 123     2000-01-01  
email2@none.com 456 Elm                 2000-03-01  
email3@none.com 123 Maple   Box 123     2000-05-01  

Solution

  • You want the latest non-null value for both address columns. Here is an approach using window functions:

    select email_address,
        max(case when trans_date = trans_date_address1 then address1 end) as address1,
        max(case when trans_date = trans_date_address2 then address2 end) as address2,
        max(birthdate) as birthdate
    from (
        select t.*,
            max(case when address1 is not null then trans_date end) over(partition by email_address) as trans_date_address1,
            max(case when address1 is not null then trans_date end) over(partition by email_address) as trans_date_address2
        from mytable t
    ) t
    group by email_address
    

    The subquery returns the latest date for which each address is not null. We can then use that information to aggregate in the outer query.

    This requires MySQL 8.0. In earlier versions, I would go for a couple of subqueries:

    select email_address,
        (
            select t1.address1 
            from mytable t1
            where t1.email_address = t.email_address and t1.address1 is not null 
            order by trans_date desc limit 1
        ) as address1,
        (
            select t1.address2
            from mytable t1
            where t1.email_address = t.email_address and t1.address2 is not null 
            order by trans_date desc limit 1
        ) as address2,
        max(birthdate) as birthdate
    from mytable t
    group by email_address