Search code examples
postgresqldistinctamazon-redshiftdistinct-onpostgresql-8.0

Redshift PostgreSQL Distinct ON Operator


I have a data set that I want to parse for to see multi-touch attribution. The data set is made up by leads who responded to a marketing campaign and their marketing source.

Each lead can respond to multiple campaigns and I want to get their first marketing source and their last marketing source in the same table.

I was thinking I could create two tables and use a select statement from both. The first table would attempt to create a table with the most recent marketing source from every person (using email as their unique ID).

create table temp.multitouch1 as (
select distinct on (email) email, date, market_source as last_source 
from sf.campaignmember
where date >= '1/1/2016' ORDER BY DATE DESC);

Then I would create a table with deduped emails but this time for the first source.

create table temp.multitouch2 as (
select distinct on (email) email, date, market_source as first_source 
from sf.campaignmember
where date >= '1/1/2016' ORDER BY DATE ASC);

Finally I wanted to simply select the email and join the first and last market sources to it each in their own column.

select a.email, a.last_source, b.first_source, a.date 
from temp.multitouch1 a
left join temp.multitouch b on b.email = a.email

Since distinct on doesn't work on redshift's postgresql version I was hoping someone had an idea to solve this issue in another way.

EDIT 2/22: For more context I'm dealing with people and campaigns they've responded to. Each record is a "campaign response" and every person can have more than one campaign response with multiple sources. I'm trying make a select statement which would dedupe by person and then have columns for the first campaign/marketing source they've responded to and the last campaign/marketing source they've responded to respectively.

EDIT 2/24: Ideal output is a table with 4 columns: email, last_source, first_source, date.

The first and last source columns would be the same for people with only 1 campaign member record and different for everyone who has more than 1 campaign member record.


Solution

  • I believe you could use row_number() inside case expressions like this:

    SELECT
          email
        , MIN(first_source) AS first_source
        , MIN(date) first_date
        , MAX(last_source) AS last_source
        , MAX(date) AS last_date
    FROM (
          SELECT
                email
              , date
              , CASE
                      WHEN ROW_NUMBER() OVER (PARTITION BY email ORDER BY date ASC) = 1 THEN market_source
                      ELSE NULL
                END AS first_source
              , CASE
                      WHEN ROW_NUMBER() OVER (PARTITION BY email ORDER BY date DESC) = 1 THEN market_source
                      ELSE NULL
                END AS last_source
          FROM sf.campaignmember
          WHERE date >= '2016-01-01'
          ) s
    WHERE first_source IS NOT NULL
          OR last_source IS NOT NULL
    GROUP BY
          email
    

    tested here: SQL Fiddle

    PostgreSQL 9.3 Schema Setup:

    CREATE TABLE campaignmember
        (email varchar(3), date timestamp, market_source varchar(1))
    ;
    
    INSERT INTO campaignmember
        (email, date, market_source)
    VALUES
        ('a@a', '2016-01-02 00:00:00', 'x'),
        ('a@a', '2016-01-03 00:00:00', 'y'),
        ('a@a', '2016-01-04 00:00:00', 'z'),
        ('b@b', '2016-01-02 00:00:00', 'x')
    ;
    

    Query 1:

    SELECT
          email
        , MIN(first_source) AS first_source
        , MIN(date) first_date
        , MAX(last_source) AS last_source
        , MAX(date) AS last_date
    FROM (
          SELECT
                email
              , date
              , CASE
                      WHEN ROW_NUMBER() OVER (PARTITION BY email ORDER BY date ASC) = 1 THEN market_source
                      ELSE NULL
                END AS first_source
              , CASE
                      WHEN ROW_NUMBER() OVER (PARTITION BY email ORDER BY date DESC) = 1 THEN market_source
                      ELSE NULL
                END AS last_source
          FROM campaignmember
          WHERE date >= '2016-01-01'
          ) s
    WHERE first_source IS NOT NULL
          OR last_source IS NOT NULL
    GROUP BY
          email
    

    Results:

    | email | first_source |                first_date | last_source |                 last_date |
    |-------|--------------|---------------------------|-------------|---------------------------|
    |   a@a |            x | January, 02 2016 00:00:00 |           z | January, 04 2016 00:00:00 |
    |   b@b |            x | January, 02 2016 00:00:00 |           x | January, 02 2016 00:00:00 |
    

    & a small extension to the request, count the number of contact points.

    SELECT
          email
        , MIN(first_source) AS first_source
        , MIN(date) first_date
        , MAX(last_source) AS last_source
        , MAX(date) AS last_date
        , MAX(numof) AS Numberof_Contacts 
    FROM (
          SELECT
                email
              , date
              , CASE
                      WHEN ROW_NUMBER() OVER (PARTITION BY email ORDER BY date ASC) = 1 THEN market_source
                      ELSE NULL
                END AS first_source
              , CASE
                      WHEN ROW_NUMBER() OVER (PARTITION BY email ORDER BY date DESC) = 1 THEN market_source
                      ELSE NULL
                END AS last_source
              , COUNT(*) OVER (PARTITION BY email) as numof
          FROM campaignmember
          WHERE date >= '2016-01-01'
          ) s
    WHERE first_source IS NOT NULL
          OR last_source IS NOT NULL
    GROUP BY
          email