I have a large unbalanced panel dataset, which looks as follows:
clear
input year id income
2003 513 500
2003 517 500
2003 518 100
2003 525 900
2003 528 800
2003 531 0
2003 532 300
2003 534 600
2004 513 1000
2004 517 120
2004 523 300
2004 525 700
2004 528 800
2004 531 200
2004 532 600
2004 534 100
end
I want to randomly sample some people by id
. id
range has gaps in positive natural numbers (minimum 513
and maximum 287321
but there are some panel dropouts, i.e. 514
, 515
, 516
).
I need to preserve the panel feature of the data. Therefore, if a random id is chosen, any year-id combination has to be kept. I do not need a random sample of the data (neither 10% or nor 10 observations). Rather I am interested in a random id-number from my id-column/variable, stored in a way that I can subsequently use it.
Thus, I am looking for a command like "pick one random value out of the given set of values from column ID". I subsequently want to use this randomly picked id in commands such as:
xtline income if id==X
Which is supposed to show me the income for all year of random person/ID X.
This wasn't well explained, at least at first, but I think you want to select panels randomly. The method below selects first observations randomly and then extends any selection to each entire panel. It doesn't take account of the number of observations in any panel. Flagging a selection with -1 is just a minor device so that selected observations sort early. The magic number 5 -- replace with any magic number of panels -- is the number of panels selected (not a percent), which is what you're asking for.
clear
input float(year id income)
2003 513 500
2004 513 1000
2003 517 500
2004 517 120
2003 518 100
2004 523 300
2003 525 900
2004 525 700
2003 528 800
2004 528 800
2003 531 0
2004 531 200
2003 532 300
2004 532 600
2003 534 600
2004 534 100
end
list, sepby(id)
+---------------------+
| year id income |
|---------------------|
1. | 2003 513 500 |
2. | 2004 513 1000 |
|---------------------|
3. | 2003 517 500 |
4. | 2004 517 120 |
|---------------------|
5. | 2003 518 100 |
|---------------------|
6. | 2004 523 300 |
|---------------------|
7. | 2003 525 900 |
8. | 2004 525 700 |
|---------------------|
9. | 2003 528 800 |
10. | 2004 528 800 |
|---------------------|
11. | 2003 531 0 |
12. | 2004 531 200 |
|---------------------|
13. | 2003 532 300 |
14. | 2004 532 600 |
|---------------------|
15. | 2003 534 600 |
16. | 2004 534 100 |
+---------------------+
bysort id : gen byte first = -(_n == 1)
set seed 1776
gen rnd = runiform()
sort first rnd
gen wanted = _n <= 5
bysort id (wanted) : replace wanted = wanted[_N]
sort id year
list id year if wanted, sepby(id)
+------------+
| id year |
|------------|
7. | 525 2003 |
8. | 525 2004 |
|------------|
9. | 528 2003 |
10. | 528 2004 |
|------------|
11. | 531 2003 |
12. | 531 2004 |
|------------|
13. | 532 2003 |
14. | 532 2004 |
|------------|
15. | 534 2003 |
16. | 534 2004 |
+------------+