I wish to take table A and create something like table B, but based on an arbitrary set of split dates contained in table C.
For example, (note it is not always true that start_date = inception_date, and so inception_date must be preserved rather than derived from start_date; this actually represents hundreds of fields that belong with the period)
I'm working in SAS but I'd like to be able to write this using PROC SQL
. I think one way to do this would be to create multiple tables for pairs of records from table C (including nulls at the end), and then union them together.
Pseudo-code example:
for each record of table_c, concoct the pairs { (., 01-Jan-2012), (01-Jan-2012, 01-Jul-2012), (01-Jul-2012, 01-Jan-2013), (01-Jan-2013, .) }
The following query may require some null testing around split_date1
and split_date2
:
CREATE TABLE subquery1 AS
SELECT
a.customer_id
,max(a.start_date, x.split_date1) AS start_date
,min(a.end_date, x.split_date2 - 1) AS end_date
,a.inception_date
FROM table_a AS a
JOIN split_date AS x
;
.... (do for each pair of split dates, and then union all these tables together with some WHERE querying to throw away the nonsensical rows) to produce table_b. The image above indicates which subquery would generate which rows in table_b
Please help me fill in the gaps, or suggest an alternative method.
table_a:
customer_id start_date end_date inception_date
aaa 18-Jun-11 17-Jun-12 18-Jun-11
aaa 18-Jun-12 17-Jun-13 18-Jun-12
bbb 13-Jul-11 12-Jul-12 13-Jul-11
ccc 14-May-11 13-Nov-11 14-Jul-11
ddd 21-Jun-11 20-Jun-12 21-Jun-11
table_b:
customer_id start_date end_date inception_date subquery
aaa 18-Jun-11 31-Dec-11 18-Jun-11 (1)
aaa 01-Jan-12 17-Jun-12 18-Jun-11 (2)
aaa 18-Jun-12 30-Jun-12 18-Jun-12 (2)
aaa 01-Jul-12 31-Dec-12 18-Jun-12 (3)
aaa 01-Jan-13 17-Jun-13 18-Jun-12 (4)
bbb 13-Jul-11 31-Dec-11 13-Jul-11 (1)
bbb 01-Jan-12 30-Jun-12 13-Jul-11 (2)
bbb 01-Jul-12 12-Jul-12 13-Jul-11 (3)
ccc 14-May-11 13-Nov-11 14-May-11 (1)
ddd 21-Jun-11 31-Dec-11 21-Jun-11 (1)
ddd 01-Jan-12 20-Jun-12 21-Jun-11 (2)
table_c:
split_dates
01-Jan-12
01-Jul-12
01-Jan-13
Here's a hybrid SQL/datastep approach - but it is shorter! Input the data (taken from the answer given by @Joe):-
data table_a;
informat start_date end_date date9.;
format start_date end_date date9.;
input customer_id $ start_date end_date;
datalines;
aaa 18JUN2011 17JUN2012
aaa 18JUN2012 17JUN2013
bbb 13JUL2011 12JUL2012
ccc 14MAY2011 13NOV2011
ddd 21JUN2011 20JUN2012
;;;;
run;
data table_c;
informat split_dates date9.;
format split_dates date9.;
input split_dates;
datalines;
01JAN2012
01JUL2012
01JAN2013
;;;;
run;
The following copies the split dates to a macro variable (SQL!) and then loops through table_a using this macro (datastep!):-
** Output the split dates to a macro variable;
proc sql noprint;
select split_dates format=8. into: c_dates separated by ',' from table_c order by split_dates;
quit;
** For each period in table_a, look to see if each split date is within it,;
** outputting a row if so;
data final_out(drop=dt old_end_date);
set table_a(rename=(end_date = old_end_date));
format start_date end_date inception_date date11.;
inception_date = start_date;
do dt = &c_dates;
if start_date <= dt <= old_end_date then do;
end_date = dt - 1;
output;
start_date = dt;
end;
end;
** For the last row per table_a entry;
end_date = old_end_date;
output;
run;
And if you know the split dates beforehand, you could hard code them into the datastep and omit the SQL bit (not recommended mind - hard coding is seldom a good idea).