Search code examples
searchstata

Finding an observation's value given values for other variables


A simplified version of my data would look like:

Own_Name    Own_Position    Year    Boss_Name    Boss_Position
John        Director        2017    Tess         Managing Director
Tess        Lead Director   2017    Jim          CEO
John        Lead Director   2018    Jim          CEO
Tess        CFO             2018    Jim          CEO

All data concerning Own_Name, Own_Position and Year is present (so Jim and any others would get their own rows with Own_Name and Boss_Name etc.), but there are many entries where the Boss_Position (but notBoss_Name`) is missing.

Thus, I'm trying to find what the Boss_Position will be, given the Boss_Name and Year. I can do this if I need to pull just one observation (just keep the relevant data where Own_Name matches Boss_Name and Year matches, and use the corresponding Own_Position), but I am not sure what the best way to do this would be where I'm looping through all the missing observations for Boss_Name, since using keep would seem to be very destructive and time consuming.

Ideally, the code would look something like

replace Boss_Position = Own_Position[YearNameMatcher(Boss_Name  Year)] if missing(Boss_Position)

where YearNameMatcher is the function that does what I'm asking for, but I am not sure how to best proceed.

I'm also new to Stata, so I may not be aware of more obvious solutions, though I have tried searching without success.


Solution

  • One solution is to generate a separate and temporary dataset with the unique values of variables Boss_Name-Year and then merge this dataset with the original data. You can try this code with your data:

    snapshot save
    
    keep if Boss_Position!=""
    drop Own*
    duplicates drop Year Boss_name, force
    tempfile boss
    save `boss'
    
    snapshot restore 1
    
    drop Boss_Position
    
    merge m:1 Boss_Name Year using `boss'
    
    snapshot erase _all
    

    This code supposes that the paired values Boss_name and Year are unique. With this approach you don't need to loop through all the missing observations.