Search code examples
sqlpartitionself-join

SQL - check if a date is within 14d in another column and return the most recent date


I have a data set that has the following columns: user_id, A_view_dt, A_conversion_dt B_view_dt.

I wanted to check If A_conversion_dt IS NOT NULL, then see if there's another record in B_view_dt that belongs to the same user_id happened 14d prior to A_conversion_dt, if so return the B_view_dt value. Otherwise return NULL under B_view_dt.

IF A_conversion_dt IS NULL, then I'd like to compare A_view_dt and B_view_dt. If a B_view_dt record exists and happens 14d before then keep the most recent max(B_view_dt).

IF B_view_dt IS NULL then I'd like B_view_dt to return NULL.

I have attached a dummy dataset & desired output. I've been trying using UNION ALL and LEFT JOINs but there's always records being missing from the WHERE clause. Anybody has a better solve? Does using partition in this case help?

Thank you!!!

See attachment here


Solution

  • I doubt there is an efficient way to run this without some preprocessing especially for large dataset.

    Basically you need 2 virtual datasets.

    • Unique user_id and choice of conversion/view date. Let's call this A_dt. In my result, I added cutoff to help with some debugging and make query simpler. Call this a
    • Unique user_id and B_View_dt. Call this b

    Select all data from a then find max B_View_dt within range of cutoff and A_dt.

    Using t-sql, I use CTE to help create virtual tables. See link to test solution - http://sqlfiddle.com/#!18/4db34/5

      ;with a (user_id, A_dt, cutoff) as 
        (   
          select
            user_id,
            isnull(A_Conversion_dt,
            A_view_dt) A_dt,
            dateadd(d,-14, isnull(A_Conversion_dt,A_view_dt)) cutoff   
        from
            t   
        group by
            user_id,
            isnull(A_Conversion_dt,
            A_view_dt) 
        ), 
        b (user_id, B_dt) as 
        (
          select
            user_id,
            B_View_dt B_dt   
        from
            t     
        group by
            user_id,
            B_View_dt 
        ) 
        
        select
            a.*,
            MaxB_dt 
        from
            a 
        outer apply 
        (   
          select
            max(B_dt) MaxB_dt   
          from
                b    
          where
            a.user_id = b.user_id     
            and B_dt between cutoff and A_dt  
        ) ob
    

    Sample Data

    enter image description here

    Result from query

    enter image description here