Search code examples
hashsasretain

Partially retaining value and conditionally replacing observation's value


The data-set looks like this:

Server  IP          Indicator Session_ID   Time
2      1.20.54.221              A           00:00:01
2      1.20.54.221              A           00:01:00
1      1.20.54.221  Site        A           00:02:00
1      1.20.54.221              B           00:05:00
2      1.20.54.221  Site        B           00:08:00
2      1.20.54.221              C           00:10:00
2      1.20.54.221              C           00:15:00
1      1.20.54.221              F           01:00:00
1      1.20.54.221              F           01:05:00
2      1.20.54.221  Site        F           01:08:00

The above data set is read from a log file. Session_id will change when the server changes ( i.e. 1 -> 2 or 2 -> 1 ). There are cases where the server changes right after a change occurred (i.e. 1---->2----->1). Whenever a server is changed, the server would record the last Session_id and return a new Session_id at its second browse of the same server. ( e.g. 3rd observation : Session_id still A, and changed to B for 4th observation) . If the server changes in the way (i.e. 1---->2----->1---->1), it will return A-----> A----->B---> C, in which B is generated by server 2 and C is generated by the second 1

My objective is to determine whether there is subgroup(s) among records. The rule is :

Given the same IP, If the time difference between the current record and last record does not exceed 30 minutes, then the record belongs to the same user.

I have an indicator variable site to determine whether the site has changed. It is marked based on whether the change of server happens and the time difference is less than 30 mins.

Desired data set:

Server  IP          Indicator Session_ID   Time           Difference  Last_site 
2      1.20.54.221              A           00:00:01       .            .
2      1.20.54.221              A           00:01:00       59s          .
1      1.20.54.221  Site        A           00:02:00       1 Min        .
1      1.20.54.221              A           00:05:00       3 Min        Site
2      1.20.54.221  Site        A           00:08:00       3 Min        Site
2      1.20.54.221              A           00:10:00       2 Min        Site
2      1.20.54.221              A           00:15:00       5 Min        Site
1      1.20.54.221              F           01:00:00       45min         . 
1      1.20.54.221              F           01:05:00       5 Min         .
2      1.20.54.221  Site        F           01:08:00       3 Min         .

My implementation:

data log_file;
set log;
retain _Session_id Last_site;,

* Assign value to retain ; 
if indicator = "Site" then _Session_id = Session_id;

* if last_site = Site, its value has to be changed;
last_site=lag(site); 

* Record that should be in another group ;
if difference >30 then Last_Site = "";  

* Replace;
if last_site  not eq "" then session_id = _session_id 

run;

And the problem is, the retained variable will change from A to B at the fifth observation, while I would like to have it stick to the value A until I found a record with time difference greater than 30 mins. ( the process would loop over hundred thousands of IP and therefore efficiency is also concerned.)

Is there a possible way to handle the problem elegantly? Thank you in advance.

[ Edited on 22/6]

I am thinking the possibility of using Hash Object to do the job. I wrote some codes (obviously they does not work and may incur syntax error ).

data test11a;
length mark $ 12 ip $ 32 Session_id $ 200 agent last_agent $ 200; <== the system said there is error with the type of variable and therefore I add this
Declare hash hid;   
hid = _new_ hash( ordered : ' ascending');
hid.definekey('ip');
hid.definedata('Session_id');
hid.definedone();
call missing ( ip, Session_id);
   do while ( not done);
    set test11 end=done; <==== I have the original data-set called test11
    by ip notsorted ; <==== hash object by group;
    if not first.ip or not last.ip then do;
       if mark = "Site" then rc=  hid.add();
      *if mark = "Next_Group" then hid.remove(key : ip); <=== Error
    end;
    if mark not eq "Site" or "Next_Group" then do;
      rc=hid.find();  <==== Find matching iP and if matching ( rc=0)
      * use variable _ session_id to test;
      if rc = 0 then _session_id = Session_id;
    end;
  end;
run;

And the output data set has only two observations. Anyone could help to explain this?


Solution

  • This is a partial answer, as I could not figure out how you construct 'Last_site'. Seems to me what you want is to check whether the difference is over 30 mins/1800 sec, if 'no', the session_id stays the same, if 'yes', it adopts a new session_id. I may have oversimplified your problem, but the outcome seems close:

    data have;
        input (Server  IP          Indicator Session_ID) (:$20.)   Time :time8.;
        format time time8.;
        cards;
    2      1.20.54.221  .            A           00:00:01
    2      1.20.54.221   .           A           00:01:00
    1      1.20.54.221  Site        A           00:02:00
    1      1.20.54.221    .          B           00:05:00
    2      1.20.54.221  Site        B           00:08:00
    2      1.20.54.221     .         C           00:10:00
    2      1.20.54.221      .        C           00:15:00
    1      1.20.54.221       .       F           01:00:00
    1      1.20.54.221        .      F           01:05:00
    2      1.20.54.221  Site        F           01:08:00
    ;
    run;
    
    data want;
        set have;
        by ip notsorted;
        retain _session ' ';
    
        if first.ip then
            _session=session_id;
        difference=dif(time);
    
        if  difference > 1800 then
            _session=session_id;
        drop session_id;
        rename _session=session_id;
    run;