Search code examples
stringsplitstatadata-cleaning

How to split data and assign it into designated variables?


I have data in Stata regarding the feeling of the current situation. There are seven types of feeling. The data is stored in the following format (note that the data type is a string, and one person can respond to more than 1 answer)

feeling
4,7
1,3,4
2,5,6,7
1,2,3,4,5,6,7

Since the data is a string, I tried to separate it by

split feeling, parse (,)

and I got the result

feeling1 feeling2 feeling3 feeling4 feeling5 feeling6 feeling7
4 7
1 3 4
2 5 6 7
1 2 3 4 5 6 7

However, this is not the result I want. which is that the representative number of feelings should go into the correct variable. For instance.

feeling1 feeling2 feeling3 feeling4 feeling5 feeling6 feeling7
4 7
1 3 4
2 5 6 7
1 2 3 4 5 6 7

I am not sure if there is any built-in command or function for this kind of problem. I am thinking about using forval in looping through every value in each variable and try to juggle it around into the correct variable.


Solution

  • A loop over the distinct values would be enough here. I give your example in a form explained in the Stata tag wiki as more helpful and then give code to get the variables you want as numeric variables.

    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str13 feeling
    "4,7"          
    "1,3,4"        
    "2,5,6,7"      
    "1,2,3,4,5,6,7"
    end
    
    forval j = 1/7 {
        gen wanted`j' = `j' if strpos(feeling, "`j'")
        gen better`j' = strpos(feeling, "`j'") > 0
    }
    
    l feeling wanted1-better3
    
         +---------------------------------------------------------------------------+
         |       feeling   wanted1   better1   wanted2   better2   wanted3   better3 |
         |---------------------------------------------------------------------------|
      1. |           4,7         .         0         .         0         .         0 |
      2. |         1,3,4         1         1         .         0         3         1 |
      3. |       2,5,6,7         .         0         2         1         .         0 |
      4. | 1,2,3,4,5,6,7         1         1         2         1         3         1 |
         +---------------------------------------------------------------------------+
    

    If you wanted a string result that would be yielded by

     gen wanted`j' = "`j'" if strpos(feeling, "`j'") 
    

    Had the number of feelings been 10 or more you would have needed more careful code as for example a search for "1" would find it within "10".

    Indicator (some say dummy) variables with distinct values 1 or 0 are immensely more valuable for most analysis of this kind of data.

    Note Stata-related sources such as

    this FAQ

    this paper

    and this paper.