I have multiple observations with different exposures and am using Stata/MP 16.1. I would like to group exposure
by id
depending on whether the exposure is mutually exclusive or not. Please see the data example.
The desired variable is groups
which I made manually. Since the dataset contains >100,000 observations, how can I implement by code for the desired variable groups
?
* Example generated by -dataex-. To install: ssc install dataex
clear
input float id str1 exposure long groups
1 "." 2
1 "a" 2
1 "a" 2
2 "a" 1
2 "." 1
2 "b" 1
2 "c" 1
3 "a" 1
3 "c" 1
3 "c" 1
4 "b" 3
4 "b" 3
4 "b" 3
end
label values groups groups
label def groups 1 "not mutually exclusive", modify
label def groups 2 "only a", modify
label def groups 3 "only b", modify
* Example generated by -dataex-. To install: ssc install dataex
clear
input float id str1 exposure long groups
1 "a" 2
1 "a" 2
1 "a" 2
2 "a" 1
2 "a" 1
2 "b" 1
2 "c" 1
3 "a" 1
3 "c" 1
3 "c" 1
4 "b" 3
4 "b" 3
4 "b" 3
end
label values groups groups
label def groups 1 "not mutually exclusive", modify
label def groups 2 "only a", modify
label def groups 3 "only b", modify
bysort id (exposure) : gen wanted = cond(exposure[1] != exposure[_N], 1, cond(exposure[1] == "a", 2, cond(exposure[1] == "b", 3, .)))
label val wanted groups
assert wanted == groups
The logic is that
if there are different values within an id
, assign 1
otherwise, the values are the same; so
assign 2 if the first value is a
(equivalently all values are a
)
assign 3 if the first value is b
(equivalently all values are b
)
otherwise assign missing -- there shouldn't be any such according to your example, but checking is a good idea.
Naturally you can break that down into shorter statements:
bysort id (exposure) : gen wanted = 1 if exposure[1] != exposure[_N]
by id: replace wanted = 2 if exposure[1] == "a"
by id: replace wanted = 3 if exposure[2] == "b"
EDIT Here is some technique more complicated set-ups. Note that Stata doesn't attach any special meaning to "."
.
* Example generated by -dataex-. To install: ssc install dataex
clear
input float id str1 exposure long groups
1 "." 2
1 "a" 2
1 "a" 2
2 "a" 1
2 "." 1
2 "b" 1
2 "c" 1
3 "a" 1
3 "c" 1
3 "c" 1
4 "b" 3
4 "b" 3
4 "b" 3
end
label values groups groups
label def groups 1 "not mutually exclusive", modify
label def groups 2 "only a", modify
label def groups 3 "only b", modify
label def groups 4 "only c", modify
gen OK = exposure != "."
sort OK id exposure
by OK id: gen wanted = 1 if OK & exposure[1] != exposure[_N]
by OK id: replace wanted = 2 if wanted == . & OK & exposure[1] == "a"
by OK id: replace wanted = 3 if wanted == . & OK & exposure[1] == "b"
by OK id: replace wanted = 4 if wanted == . & OK & exposure[1] == "c"
bysort id (exposure OK) : replace wanted = wanted[_N]
drop OK
label val wanted groups
list, sepby(id)
+-----------------------------------------------------------------+
| id exposure groups wanted |
|-----------------------------------------------------------------|
1. | 1 . only a only a |
2. | 1 a only a only a |
3. | 1 a only a only a |
|-----------------------------------------------------------------|
4. | 2 . not mutually exclusive not mutually exclusive |
5. | 2 a not mutually exclusive not mutually exclusive |
6. | 2 b not mutually exclusive not mutually exclusive |
7. | 2 c not mutually exclusive not mutually exclusive |
|-----------------------------------------------------------------|
8. | 3 a not mutually exclusive not mutually exclusive |
9. | 3 c not mutually exclusive not mutually exclusive |
10. | 3 c not mutually exclusive not mutually exclusive |
|-----------------------------------------------------------------|
11. | 4 b only b only b |
12. | 4 b only b only b |
13. | 4 b only b only b |
+-----------------------------------------------------------------+