Search code examples
sqldatesas

Create table that splits records on specific dates using SAS


I wish to take table A and create something like table B, but based on an arbitrary set of split dates contained in table C.

For example, (note it is not always true that start_date = inception_date, and so inception_date must be preserved rather than derived from start_date; this actually represents hundreds of fields that belong with the period)

enter image description here

I'm working in SAS but I'd like to be able to write this using PROC SQL. I think one way to do this would be to create multiple tables for pairs of records from table C (including nulls at the end), and then union them together.

Pseudo-code example:

for each record of table_c, concoct the pairs { (., 01-Jan-2012), (01-Jan-2012, 01-Jul-2012), (01-Jul-2012, 01-Jan-2013), (01-Jan-2013, .) }

The following query may require some null testing around split_date1 and split_date2:

CREATE TABLE subquery1 AS
SELECT 
    a.customer_id
    ,max(a.start_date, x.split_date1) AS start_date
    ,min(a.end_date, x.split_date2 - 1) AS end_date
    ,a.inception_date
FROM table_a AS a
JOIN split_date AS x
;
.... (do for each pair of split dates, and then union all these tables together with some WHERE querying to throw away the nonsensical rows) to produce table_b. The image above indicates which subquery would generate which rows in table_b

Please help me fill in the gaps, or suggest an alternative method.

table_a:

customer_id start_date  end_date    inception_date      
aaa 18-Jun-11   17-Jun-12   18-Jun-11       
aaa 18-Jun-12   17-Jun-13   18-Jun-12       
bbb 13-Jul-11   12-Jul-12   13-Jul-11       
ccc 14-May-11   13-Nov-11   14-Jul-11       
ddd 21-Jun-11   20-Jun-12   21-Jun-11

table_b:

customer_id start_date  end_date    inception_date      subquery
aaa 18-Jun-11   31-Dec-11   18-Jun-11       (1)
aaa 01-Jan-12   17-Jun-12   18-Jun-11       (2)
aaa 18-Jun-12   30-Jun-12   18-Jun-12       (2)
aaa 01-Jul-12   31-Dec-12   18-Jun-12       (3)
aaa 01-Jan-13   17-Jun-13   18-Jun-12       (4)
bbb 13-Jul-11   31-Dec-11   13-Jul-11       (1)
bbb 01-Jan-12   30-Jun-12   13-Jul-11       (2)
bbb 01-Jul-12   12-Jul-12   13-Jul-11       (3)
ccc 14-May-11   13-Nov-11   14-May-11       (1)
ddd 21-Jun-11   31-Dec-11   21-Jun-11       (1)
ddd 01-Jan-12   20-Jun-12   21-Jun-11       (2)

table_c:

split_dates                 
01-Jan-12                   
01-Jul-12                   
01-Jan-13   

Solution

  • Here's a hybrid SQL/datastep approach - but it is shorter! Input the data (taken from the answer given by @Joe):-

    data table_a;
      informat start_date end_date date9.;
      format start_date end_date date9.;
      input customer_id $ start_date end_date;
      datalines;
      aaa 18JUN2011 17JUN2012
      aaa 18JUN2012 17JUN2013
      bbb 13JUL2011 12JUL2012
      ccc 14MAY2011 13NOV2011
      ddd 21JUN2011 20JUN2012
      ;;;;
    run;
    
    data table_c;
      informat split_dates date9.;
      format split_dates date9.;
      input split_dates;
      datalines;
      01JAN2012
      01JUL2012
      01JAN2013
      ;;;;
    run;
    

    The following copies the split dates to a macro variable (SQL!) and then loops through table_a using this macro (datastep!):-

    **  Output the split dates to a macro variable;
    proc sql noprint;
      select split_dates format=8. into: c_dates separated by ',' from table_c order by split_dates;
    quit;
    
    **  For each period in table_a, look to see if each split date is within it,;
    **  outputting a row if so;
    data final_out(drop=dt old_end_date);
      set table_a(rename=(end_date = old_end_date));
    
      format start_date end_date inception_date date11.;
      inception_date = start_date;
    
      do dt = &c_dates;
    
        if start_date <= dt <= old_end_date then do;
          end_date = dt - 1;
          output;
          start_date = dt;
        end;
    
      end;
    
      **  For the last row per table_a entry;
      end_date = old_end_date;
      output;
    run;
    

    And if you know the split dates beforehand, you could hard code them into the datastep and omit the SQL bit (not recommended mind - hard coding is seldom a good idea).