Search code examples
appendstata

Appending datasets by matched variables


I have to append three datasets named A, B and C that contain data for various years (for example, 1990, 1991...2014).

The problem is that not all datasets contain all the survey years and therefore the unmatched years need to be dropped manually before appending.

I would like to know if there is any way to append three (or more) datasets that will keep only the matched variables across the datasets (years in this case).


Solution

  • Consider the following toy example:

    clear
    
    input year var
    1995 0
    1996 1
    1997 2
    1998 3 
    1999 4
    2000 5
    end
    
    save data1, replace
    
    clear
    
    input year var
    1995 6
    1996 9
    1998 7 
    1999 8
    2000 9
    end
    
    save data2, replace
    
    clear
    
    input year var
    1995 10
    1996 11
    1997 12
    2000 13
    end
    
    save data3, replace
    

    There is no option that will force append to do what you want, but you can do the following:

    use data1, clear
    
    append using data2 data3
    duplicates tag year, generate(tag)
    
    sort year
    list
    
         +------------------+
         | year   var   tag |
         |------------------|
      1. | 1995     0     2 |
      2. | 1995     6     2 |
      3. | 1995    10     2 |
      4. | 1996     9     2 |
      5. | 1996     1     2 |
         |------------------|
      6. | 1996    11     2 |
      7. | 1997     2     1 |
      8. | 1997    12     1 |
      9. | 1998     7     1 |
     10. | 1998     3     1 |
         |------------------|
     11. | 1999     8     1 |
     12. | 1999     4     1 |
     13. | 2000    13     2 |
     14. | 2000     5     2 |
     15. | 2000     9     2 |
         +------------------+
    
    drop if tag == 1
    list
    
         +------------------+
         | year   var   tag |
         |------------------|
      1. | 1995     0     2 |
      2. | 1995     6     2 |
      3. | 1995    10     2 |
      4. | 1996     9     2 |
      5. | 1996     1     2 |
         |------------------|
      6. | 1996    11     2 |
      7. | 2000    13     2 |
      8. | 2000     5     2 |
      9. | 2000     9     2 |
         +------------------+
    

    You can also further generalize this approach by finding the maximum value of the variable tag and keeping all observations with that value:

    summarize tag
    keep if tag == `r(max)'