Search code examples
stata

Splitting string data and corresponding data into new row


I have some data that has a string variable (US states), a corresponding integer variable (enrollment) and another string.

Unfortunately, some of the cells under the US states variable have multiple states listed separated by a semi-colon. I'd like to split these up into different rows and then divide the corresponding enrollment equally among those states.

For example, I have:

State       Enrollment   Severity
CA            100          Low
MA;PA         50           Medium
WA;OR;ID      120          High

And I want to be able to transform this into:

State       Enrollment    Severity
CA             100          Low
MA             25           Medium
PA             25           Medium
WA             40           High
OR             40           High
ID             40           High

I have tried separating them using the split command and then (in a convoluted fashion, calculating the corresponding enrollment) but I'm not quite sure how to get them on to new rows even with reshape.


EDIT:

I would also like the solution to be able to handle duplicate States.

For example:

State       Enrollment   Severity
CA            100          Low
MA;CA         50           Medium
WA;CA;ID      120          High

Transformed into:

State       Enrollment    Severity
CA             100          Low
MA             25           Medium
CA             25           Medium
WA             40           High
CA             40           High
ID             40           High

Solution

  • Here's one way to do what you want using your original data:

    clear 
    input str10 State Enrollment str10 Severity
    "CA" 100 "Low"
    "MA;PA" 50 "Medium"
    "WA;OR;ID" 120 "High"
    end
    
    generate id = _n
    split State, p(;)
    drop State
    reshape long State, i(State?)
    drop State?
    
    keep if State != ""
    bysort State (id): egen maxval = max(id)
    bysort State (id): generate enrol = Enrollment / maxval
    drop Enrollment
    rename enrol Enrollment
    
    sort id
    drop id _j maxval
    order State Enrollment Severity
    
    list, abbreviate(20)
    
         +-------------------------------+
         | State   Enrollment   Severity |
         |-------------------------------|
      1. |    CA          100        Low |
      2. |    MA           25     Medium |
      3. |    PA           25     Medium |
      4. |    OR           40       High |
      5. |    ID           40       High |
      6. |    WA           40       High |
         +-------------------------------+
    

    EDIT:

    Here's one way to do what you want using your revised data:

    clear
    input str10 State Enrollment str10 Severity
    "CA"            100          "Low"
    "MA;CA"         50           "Medium"
    "WA;CA;ID"      120          "High"
    end
    
    generate id = _n
    split State, p(;)
    drop State
    
    reshape long State, i(id)
    
    keep if State != ""
    bysort id: egen maxval = count(id)
    bysort id: generate enrol = Enrollment / maxval
    drop Enrollment
    rename enrol Enrollment
    
    sort id
    drop id _j maxval
    order State Enrollment Severity
    
    list, abbreviate(20)
    
         +-------------------------------+
         | State   Enrollment   Severity |
         |-------------------------------|
      1. |    CA          100        Low |
      2. |    MA           25     Medium |
      3. |    CA           25     Medium |
      4. |    WA           40       High |
      5. |    CA           40       High |
      6. |    ID           40       High |
         +-------------------------------+