Search code examples
foreachstatamissing-data

Previous observations


I need your help with the following problem. My data is

ack            det                     uniquenr

02mar2015 10:30:19  27feb2015 17:03:25 1

09mar2015 11:56:44  07mar2015 22:05:52 2

16mar2015 09:59:51  09mar2015 11:08:36 2

16mar2015 09:59:51  09mar2015 11:08:36 2

16mar2015 09:59:51  09mar2015 11:08:36 2

16mar2015 09:59:51  09mar2015 11:08:36 2

24feb2015 15:03:00  21feb2015 10:23:00 3

ack and det are datetimes. det must be matched to the earliest possible following ack with the same unique number. As appears in the example above, det of 09mar2015 is wrong. I would like the result to be:

ack            det                     uniquenr

02mar2015 10:30:19  27feb2015 17:03:25 1

09mar2015 11:56:44  07mar2015 22:05:52 2

09mar2015 11:56:44  09mar2015 11:08:36 2

09mar2015 11:56:44  09mar2015 11:08:36 2

09mar2015 11:56:44  09mar2015 11:08:36 2

09mar2015 11:56:44  09mar2015 11:08:36 2

24feb2015 15:03:00  21feb2015 10:23:00 3

I have tried the following code:

sort uniquenr det
bysort uniquenr (det): gen v1= (ack[_n-1]-det)/(1000*60*60)
gen v2= (ack-det)/(1000*60*60)
gen  bedrenabo=1 if v1<v2
order bedrenabo v1 v2
replace bedrenabo=0 if v1==v2
replace bedrenabo=0 if v1<0
replace bedrenabo=0 if v1>v2
gen double newack1=ack[_n-1] if bedrenabo==1
format newack %tc
order Handling newack1 ack det tra
sort uniquenr det
replace ack=newack1 if newack1!=. 

and my result is

    ack            det                     uniquenr newack 

02mar2015 10:30:19  27feb2015 17:03:25 1

09mar2015 11:56:44  07mar2015 22:05:52 2

16mar2015 09:59:51  09mar2015 11:08:36 2 09mar2015 11:56:44

16mar2015 09:59:51  09mar2015 11:08:36 2

16mar2015 09:59:51  09mar2015 11:08:36 2

16mar2015 09:59:51  09mar2015 11:08:36 2

24feb2015 15:03:00  21feb2015 10:23:00 3

The problem seems to arise because there are duplicates in the data and I use [_n-1]. The optimal solution would be to use a replace command which for all observations with the same det and uniquenr replaces all missing values with the known newack - if there is a newack. The command bysort uniquenr det : replace nyack1 = newack1[_n-1] if missing(newack1) gives different amounts of change for each time.

I am not yet familiar with foreach loops, but if that's the way, I am open to it.


Solution

  • Ok, ok, assuming that I understand the question, here is one possible solution.

    Please note, that I've modified you original example to make it more general (otherwise a simple egen will do, but it seems to me that is not what you want).

    // Input the data
    // @Msh, generally it is helpful if you provide somthing like this with your Q.
    clear
    input str18 ackstr str18 detstr uniquenr
    "02mar2015 10:30:19" "27feb2015 17:03:25" 1
    "09mar2015 11:56:44" "07mar2015 22:05:52" 2
    "16mar2015 09:59:51" "09mar2015 11:08:36" 2
    "16mar2015 09:59:51" "09mar2015 11:08:36" 2
    "16mar2015 09:59:51" "09mar2015 11:08:36" 2
    "16mar2015 09:59:51" "09mar2015 11:08:36" 2
    "24feb2015 15:03:00" "21feb2015 10:23:00" 3
    end
    
    gen double ack = clock(ackstr, "DMYhms")
    gen double det = clock(detstr, "DMYhms")
    format ack det %tc
    drop *str
    
    // I will modify the original data to make it more general
    replace ack = tc(08mar2015 00:00:00) in 5
    
    // Add row identifiers
    gen id = _n
    
    // Create all possible combinations of ack and det withing the same uniquenr
    preserve
        keep uniquenr ack
        rename ack new=
        tempfile ack
        save `ack'
    restore
    joinby uniquenr using `ack'
    
    // Drop ack that are smaller than det
    drop if newack < det
    
    // Match the smallest ack to a given det
    ds newack id, not
    collapse (min) newack (last) `r(varlist)', by(id)
    

    You can also do it with a loop, e.g. in this manner:

    // Add row numbers
    gen id = _n
    
    // Prepare newack
    gen double newack = .
    format newack %tc
    
    // Ensure the data is sorted
    sort uniquenr ack
    
    // Compute block boundaries
    gen row = _n
    by uniquenr: gen min = row[1]
    by uniquenr: gen max = row[_N]
    
    // Compute new ack
    forvalues i = 1/`c(N)' {
        forvalues j = `=min[`i']'/`=max[`i']' {
            if `=ack[`j']' >= `=det[`i']' {
                replace newack = ack[`j'] in `i'
                continue, break
            }
        }
    }
    
    // Revert the sorting
    sort id
    

    If you have a large dataset, it might make sense to rewrite the loop in mata, which should be faster.