Search code examples
sasenterprise-guide

All the records within an hour of each other


I have a data set that has ID, datetime + bunch of value fields.

The idea is that the records are within one hour of each other are one session. There can only be one session every 24 hours. (Time is measured from the start of the first record)

The day() approach does not work as one record can be 23:55 PM and the next one could be 12:01 AM the next day and it would be the same session.

I've added rowid and ran the following:

data testing;
set testing;
by subscriber_no;
prev_dt = lag(record_ts);
prev_row = lag(rowid);
time_from_last = intck("Second",record_ts,prev_dt);
if intck("Second",record_ts,prev_dt) >  -60*60 and intck("Second",record_ts,prev_dt) < 0 then
    same_session = 'yes';
else same_session = 'no';
if intck("Second",record_ts,prev_dt) >  -60*60 and intck("Second",record_ts,prev_dt) < 0 then
    rowid  =  prev_row;
else rowid = rowid;
format prev_dt datetime19.;
output;
run;

Input

ID  record_TS   rowid
52  17MAY2017:06:24:28  4
52  17MAY2017:07:16:12  5
91  05APR2017:07:04:55  6
91  05APR2017:07:23:37  7
91  05APR2017:08:04:52  8
91  05MAY2017:08:56:23  9

input file is sorted by ID and record TS.

The output was

ID  record_TS   rowid   prev_dt prev_row    time_from_last  same_session    
52  17MAY2017:06:24:28  4   28APR2017:08:51:25  3   -1632783    no  
52  17MAY2017:07:16:12  4   17MAY2017:06:24:28  4   -3104   yes 
91  05APR2017:07:04:55  6   17MAY2017:07:16:12  5   3629477 no  
91  05APR2017:07:23:37  6   05APR2017:07:04:55  6   -1122   yes 
91  05APR2017:08:04:52  7   05APR2017:07:23:37  7   -2475   yes This needs to be 6
91  05MAY2017:08:56:23  9   05APR2017:08:04:52  8   -2595091    no  

Second row from the bottom - rowid comes out 7, while I need it to come be 6.

Basically I need to change to the current rowid saved before the script moves to assess the next one.

Thank you Ben

I've achieved what I needed with

proc sql;
    create table testing2 as
        select distinct t1.*, min(t2.record_TS) format datetime19. as from_time, max(t2.record_TS) format datetime19. as to_time
        from testing t1
        join testing t2 on t1.id_val= t2.id_val
        and intck("Second",t1.record_ts,t2.record_ts) between -3600 and 3600
    group by t1.id_val, t1.record_ts
order by t1.id_val, t1.record_ts
;

quit;

But I'm still wondering if there is a way to commit changes to current row before moving to assess the next row.


Solution

  • I think your logic is just:

    1. Grab record_TS datetime of the first record for each ID
    2. For subsequent records, if their record_TS is within an hour of the first record's, recode it to be the same rowID as first record.

    If that's the case, you can use RETAIN to keep track of the first record_TS and rowID for each ID. This should be easier than lag(), and allows there to be multiple records in a single session. Below seems to work:

    data have;
      input ID  record_TS  datetime. rowid;
      format record_TS datetime.;
      cards;
    52 17MAY2017:06:24:28 4
    52 17MAY2017:07:16:12 5
    91 05APR2017:07:04:55 6
    91 05APR2017:07:23:37 7
    91 05APR2017:08:04:52 8
    91 05MAY2017:08:56:23 9
    ;
    run;
    
    data want;
      set have;
      by ID Record_TS;
      retain SessionStart SessionRowID;
      if first.ID then do;
        SessionStart=Record_TS;
        SessionRowID=RowID;
      end;
      else if (record_TS-SessionStart)<(60*60) then RowID=SessionRowID;
      drop SessionStart SessionRowID;
    run;
    

    Outputs:

    ID       record_TS        rowid
    
    52    17MAY17:06:24:28      4
    52    17MAY17:07:16:12      4
    91    05APR17:07:04:55      6
    91    05APR17:07:23:37      6
    91    05APR17:08:04:52      6
    91    05MAY17:08:56:23      9