I have a table of transactions ('transactions_2020') that include an email address, transaction details, date, etc. These transactions include address and other PII information.
Multiple transactions per email address are common in the table. I want to create a table of unique email addresses ('individuals') and keep all related PII information. For those cases where there are multiple transactions per email address
I'd like to keep the values for the columns associated with the most recent transaction, but only if those fields are not null. Resulting in a consolidated row in my 'individuals' table that has the best/most recent information, even if that information comes from different transactions. Simple example below (blanks are null):
transactions table
email_address trans_date address1 address2 birthdate
email1@none.com 2020-10-01 2000-01-01
email1@none.com 2020-09-01 Box 123
email1@none.com 2020-08-01 123 Main
email2@none.com 2020-12-01 456 Elm 2000-03-01
email2@none.com 2020-07-01 123 Elm 2000-02-01
email3@none.com 2020-11-01 123 Maple 2000-05-01
email3@none.com 2020-09-01 123 Maple Box 123
individual table
email_address address1 address2 birthdate
email1@none.com 123 Main Box 123 2000-01-01
email2@none.com 456 Elm 2000-03-01
email3@none.com 123 Maple Box 123 2000-05-01
You want the latest non-null
value for both address columns. Here is an approach using window functions:
select email_address,
max(case when trans_date = trans_date_address1 then address1 end) as address1,
max(case when trans_date = trans_date_address2 then address2 end) as address2,
max(birthdate) as birthdate
from (
select t.*,
max(case when address1 is not null then trans_date end) over(partition by email_address) as trans_date_address1,
max(case when address1 is not null then trans_date end) over(partition by email_address) as trans_date_address2
from mytable t
) t
group by email_address
The subquery returns the latest date for which each address is not null
. We can then use that information to aggregate in the outer query.
This requires MySQL 8.0. In earlier versions, I would go for a couple of subqueries:
select email_address,
(
select t1.address1
from mytable t1
where t1.email_address = t.email_address and t1.address1 is not null
order by trans_date desc limit 1
) as address1,
(
select t1.address2
from mytable t1
where t1.email_address = t.email_address and t1.address2 is not null
order by trans_date desc limit 1
) as address2,
max(birthdate) as birthdate
from mytable t
group by email_address