I have some data that has a string variable (US states
), a corresponding integer variable (enrollment
) and another string.
Unfortunately, some of the cells under the US states
variable have multiple states listed separated by a semi-colon. I'd like to split these up into different rows and then divide the corresponding enrollment equally among those states.
For example, I have:
State Enrollment Severity
CA 100 Low
MA;PA 50 Medium
WA;OR;ID 120 High
And I want to be able to transform this into:
State Enrollment Severity
CA 100 Low
MA 25 Medium
PA 25 Medium
WA 40 High
OR 40 High
ID 40 High
I have tried separating them using the split
command and then (in a convoluted fashion, calculating the corresponding enrollment) but I'm not quite sure how to get them on to new rows even with reshape
.
EDIT:
I would also like the solution to be able to handle duplicate States.
For example:
State Enrollment Severity
CA 100 Low
MA;CA 50 Medium
WA;CA;ID 120 High
Transformed into:
State Enrollment Severity
CA 100 Low
MA 25 Medium
CA 25 Medium
WA 40 High
CA 40 High
ID 40 High
Here's one way to do what you want using your original data:
clear
input str10 State Enrollment str10 Severity
"CA" 100 "Low"
"MA;PA" 50 "Medium"
"WA;OR;ID" 120 "High"
end
generate id = _n
split State, p(;)
drop State
reshape long State, i(State?)
drop State?
keep if State != ""
bysort State (id): egen maxval = max(id)
bysort State (id): generate enrol = Enrollment / maxval
drop Enrollment
rename enrol Enrollment
sort id
drop id _j maxval
order State Enrollment Severity
list, abbreviate(20)
+-------------------------------+
| State Enrollment Severity |
|-------------------------------|
1. | CA 100 Low |
2. | MA 25 Medium |
3. | PA 25 Medium |
4. | OR 40 High |
5. | ID 40 High |
6. | WA 40 High |
+-------------------------------+
EDIT:
Here's one way to do what you want using your revised data:
clear
input str10 State Enrollment str10 Severity
"CA" 100 "Low"
"MA;CA" 50 "Medium"
"WA;CA;ID" 120 "High"
end
generate id = _n
split State, p(;)
drop State
reshape long State, i(id)
keep if State != ""
bysort id: egen maxval = count(id)
bysort id: generate enrol = Enrollment / maxval
drop Enrollment
rename enrol Enrollment
sort id
drop id _j maxval
order State Enrollment Severity
list, abbreviate(20)
+-------------------------------+
| State Enrollment Severity |
|-------------------------------|
1. | CA 100 Low |
2. | MA 25 Medium |
3. | CA 25 Medium |
4. | WA 40 High |
5. | CA 40 High |
6. | ID 40 High |
+-------------------------------+