I have a dataset which has certain variables , now I have to compare those variables among rows to find the duplicate , but it has few extra asks .
- if final_amt matches the final_amt from next row and final_amt_added matches final_amt_added from next row we have to consider them as duplicate and delete only the case where the tran_date is blank.
- Also if we are deleting the duplicate row with the blank tran_date , the actual_amt data from that row should be retained as a new variable act_amt_extra in the undeleted row.
I have tried multiple scenarios using lag and retain but the data is just not coming correctly, any help will be appreciated.
DATA work.sample;
INPUT acct_num test_id tran_date:anydtdte. actual_amt final_amt final_amt_added ;
format tran_date date9.;
DATALINES;
55203610 2542 12-jan-20 30 45 45
16124130 8062 . 56 78 78
16124130 8062 14-dec-19 8 78 78
80479512 2062 19-mar-19 32 32 32
70321918 2062 20-dec-19 1 93 54
17312410 6712 . 45 90 90
17312410 6712 15-jun-18 0 90 90
74623123 2092 17-aug-18 34 87 87
24245321 2082 22-jan-17 22 56 67
;
run;
data that I want
data want;
input acct_num test_id tran_date:anydtdte. actual_amt final_amt final_amt_added act_amt_extra;
format tran_date date9.;
DATALINES;
55203610 2542 12-jan-20 30 45 45
16124130 8062 14-dec-19 8 78 78 56
80479512 2062 19-mar-19 32 32 32
70321918 2062 20-dec-19 1 93 54
17312410 6712 15-jun-18 0 90 90 45
74623123 2092 17-aug-18 34 87 87
24245321 2082 22-jan-17 22 56 67
;
run;
This may not be the most elegant way of handling it, but it will work so long as you only have a single duplicate as shown in your example.
The data we are most interested in is when the date is not missing. First, sort your data by acct_num test_id
and descending tran_date
:
proc sort data=sample;
by acct_num test_id descending tran_date;
run;
This ensures that the data we care about is first:
acct_num test_id tran_date actual_amt final_amt final_amt_added
16124130 8062 14DEC2019 8 78 78
16124130 8062 . 56 78 78
We want to output only at the last combination of acct_num test_id
. When there are duplicates, we want to pull forward the following values:
tran_date
actual_amt
Note that when there are duplicates and we want to pull values forward, we know the following:
last.test_id
is truefirst.test_id
is falsetran_date
is missingfinal_amt
matches the prior row's valuefinal_amt_added
matches the prior row's valueWhen that case occurs, we'll run our logic. Otherwise, we'll leave the row as-is and output.
data want;
set sample;
by acct_num test_id descending tran_date;
/* Store prior values */
lag_actual_amt = lag(actual_amt);
lag_final_amt_added = lag(final_amt_added);
lag_final_amt = lag(final_amt);
lag_tran_date = lag(tran_date);
/* Bring forward data if conditions are met */
if( missing(tran_date)
AND last.test_id
AND NOT first.test_id
AND final_amt = lag_final_amt
AND final_amt_added = lag_final_amt_added
)
then do;
act_amt_extra = actual_amt;
/* Bring forward the prior values */
tran_date = lag_tran_date;
actual_amt = lag_actual_amt;
end;
/* Only output for the last combination of acct_num, test_id */
if(last.test_id);
drop lag:;
run;