Search code examples
replacestatastata-macros

Copy values from one observation to another based on unique identifier


I have a dataset with many variables. For a number of observations I want to replace their values (in a large number of variables) with the values of another observation (for the same variables) based on a unique identifier. The unique identifier is not equal to the Stata observation numbers (_n).

This is similar to several other threads, but as far as I can tell, their solutions are not fully transferrable

  1. transfer values from one variable to another in Stata But I have no simple lexical relationship that I could use
  2. https://www.stata.com/statalist/archive/2013-06/msg00056.html But I can't rely on the Stata observation numbers, because the dataset is dynamic and the order and number of observations may change.

Here is a piece of code based on the solution from the second link but adjusted to my problem (I changed the content of the square bracket in the last macro). I am aware that this code does not work because the square bracket here could only contain a Stata obs number (_n). But I think this should illustrate well, what I am trying to achieve:

In this example, I would like to replace the values of the observations whose unique_id is 25 and 38 with the values of the observation whose unique_id is 21. This should be done for all variables in my local varlist.

set obs 50

local vlist v1 v2 v3 v4 v5

foreach v of local vlist { 
    generate `v' = runiform()
}

gen unique_id=_n+20  // this is just to illustrate that the unique_id is not equal to _n


foreach var of local vlist {
replace `var' =`var'[unique_id==21] if unique_id==25 | unique_id==38
}

Solution

  • Here is one way to do it:

    foreach var of local vlist {
        su `var' if unique_id == 21, meanonly 
        replace `var' = r(mean) if unique_id==25 | unique_id==38
    }
    

    And here is another way to do it:

    gen reference = unique_id == 21 
    sort reference 
    
    foreach var of local vlist {
        replace `var' = `var'[_N] if unique_id==25 | unique_id==38
    }
    

    Of course if you looked at the data and worked out that at the moment identifier 21 is in observation 42, say, then using that as subscript that is a direct answer, but not good or robust style.

    See also this paper for a review of some techniques in this territory.

    I'd sympathise with anyone who thought that Stata style looks awkward for this problem. Also, this kind of question seems terribly ad hoc, but then problems like this are real too.