Search code examples
selectrandomsubsetstatasample

How to sample an item from a variable


I have a large unbalanced panel dataset, which looks as follows:

clear

input year id income
2003 513 500
2003 517 500
2003 518 100
2003 525 900
2003 528 800
2003 531 0
2003 532 300
2003 534 600
2004 513 1000
2004 517 120
2004 523 300
2004 525 700
2004 528 800
2004 531 200
2004 532 600
2004 534 100
end

I want to randomly sample some people by id. id range has gaps in positive natural numbers (minimum 513 and maximum 287321 but there are some panel dropouts, i.e. 514, 515, 516).

I need to preserve the panel feature of the data. Therefore, if a random id is chosen, any year-id combination has to be kept. I do not need a random sample of the data (neither 10% or nor 10 observations). Rather I am interested in a random id-number from my id-column/variable, stored in a way that I can subsequently use it.

Thus, I am looking for a command like "pick one random value out of the given set of values from column ID". I subsequently want to use this randomly picked id in commands such as:

xtline income if id==X

Which is supposed to show me the income for all year of random person/ID X.


Solution

  • This wasn't well explained, at least at first, but I think you want to select panels randomly. The method below selects first observations randomly and then extends any selection to each entire panel. It doesn't take account of the number of observations in any panel. Flagging a selection with -1 is just a minor device so that selected observations sort early. The magic number 5 -- replace with any magic number of panels -- is the number of panels selected (not a percent), which is what you're asking for.

    clear
    input float(year id income)
    2003 513  500
    2004 513 1000
    2003 517  500
    2004 517  120
    2003 518  100
    2004 523  300
    2003 525  900
    2004 525  700
    2003 528  800
    2004 528  800
    2003 531    0
    2004 531  200
    2003 532  300
    2004 532  600
    2003 534  600
    2004 534  100
    end
    
    list, sepby(id) 
    
         +---------------------+
         | year    id   income |
         |---------------------|
      1. | 2003   513      500 |
      2. | 2004   513     1000 |
         |---------------------|
      3. | 2003   517      500 |
      4. | 2004   517      120 |
         |---------------------|
      5. | 2003   518      100 |
         |---------------------|
      6. | 2004   523      300 |
         |---------------------|
      7. | 2003   525      900 |
      8. | 2004   525      700 |
         |---------------------|
      9. | 2003   528      800 |
     10. | 2004   528      800 |
         |---------------------|
     11. | 2003   531        0 |
     12. | 2004   531      200 |
         |---------------------|
     13. | 2003   532      300 |
     14. | 2004   532      600 |
         |---------------------|
     15. | 2003   534      600 |
     16. | 2004   534      100 |
         +---------------------+
    
    bysort id : gen byte first = -(_n == 1)
    set seed 1776
    gen rnd = runiform()
    sort first rnd
    gen wanted = _n <= 5
    bysort id (wanted) : replace wanted = wanted[_N]
    sort id year
    
    list id year if wanted, sepby(id)
    
         +------------+
         |  id   year |
         |------------|
      7. | 525   2003 |
      8. | 525   2004 |
         |------------|
      9. | 528   2003 |
     10. | 528   2004 |
         |------------|
     11. | 531   2003 |
     12. | 531   2004 |
         |------------|
     13. | 532   2003 |
     14. | 532   2004 |
         |------------|
     15. | 534   2003 |
     16. | 534   2004 |
         +------------+