Search code examples
databasestata

Transforming a dataset in a more compact format (Stata)


initially I was dealing with a dataset that looked something like this:

+------+--------+-----------+-------+
| date |  geo   | variables | value |
+------+--------+-----------+-------+
| 1981 | Canada | var1      | #     |
| 1982 | Canada | var1      | #     |
| 1983 | Canada | var1      | #     |
|  ... | ...    | ...       | ...   |
| 2015 | Canada | var2      | #     |
| 1981 | Canada | var2      | #     |
| 1982 | Canada | var2      | #     |
|  ... | ...    | ...       | ...   |
| 2015 | Canada | var2      | #     |
| 1981 | Quebec | var1      | #     |
| 1982 | Quebec | var1      | #     |
| 1983 | Quebec | var1      | #     |
|  ... | ...    | ...       | ...   |
| 2015 | Quebec | var2      | #     |
| 1981 | Quebec | var2      | #     |
| 1982 | Quebec | var2      | #     |
|  ... | ...    | ...       | ...   |
| 2015 | Quebec | var2      | #     |
+------+--------+-----------+-------+

So I have 35 time periods, two countries and two variables. I would like to transform the table in Stata for it to look like this:

+------+--------+------+------+
| date |  geo   | var1 | var2 |
+------+--------+------+------+
| 1981 | Canada | #    | #    |
| 1982 | Canada | #    | #    |
|  ... | ...    | ...  | ...  |
| 2015 | Canada | #    | #    |
| 1981 | Quebec | #    | #    |
| 1982 | Quebec | #    | #    |
|  ... | ...    | ...  | ...  |
| 2015 | Quebec | #    | #    |
+------+--------+------+------+

However, I'm not having much success with this. I tried to separate the different observations into variables with the command:

separate value, by(variables) generate(var)

Which creates something like this:

+------+--------+------+------+
| date |  geo   | var1 | var2 |
+------+--------+------+------+
| 1981 | Canada | #    | .    |
| 1982 | Canada | #    | .    |
|  ... | ...    | ...  | ...  |
| 2015 | Canada | #    | .    |
| 1981 | Canada | .    | #    |
| 1982 | Canada | .    | #    |
|  ... | ...    | ...  | ...  |
| 2015 | Canada | .    | #    |
| 1981 | Quebec | #    | .    |
| 1982 | Quebec | #    | .    |
|  ... | ...    | ...  | ...  |
| 2015 | Quebec | #    | .    |
| 1981 | Quebec | .    | #    |
| 1982 | Quebec | .    | #    |
|  ... | ...    | ...  | ...  |
| 2015 | Quebec | .    | #    |
+------+--------+------+------+

Which contains a lot of useless missing values.

So, more specifically, I would like something to bring me directly to Table A to B (i.e. without using separate), or a solution to fix Table C into B.

Thanks a lot.


Solution

  • Without sample data, my answer will have to be untested. I think something like the following will get you started in the right direction.

    reshape wide value, i(date geo) j(variables) string
    

    Note that this assumes the contents of your variables variable are suitable for use as variable names. For example, a value of 1potato for variables would be a problem.

    In any event,

    help reshape
    

    should be your first stop.

    Added in response to comment: Here is some data I made up and a demonstration that reshape works for this data. Perhaps you can explain how this data differs from the real data. Your error message suggests that for some combination of date and geo, a particular value of variables occurs more than once.

    . list, sepby(geo)
    
         +----------------------------------+
         | date      geo   variab~s   value |
         |----------------------------------|
      1. | 1981   Canada       var1     111 |
      2. | 1982   Canada       var1     211 |
      3. | 1983   Canada       var1     311 |
      4. | 1981   Canada       var2     112 |
      5. | 1982   Canada       var2     212 |
      6. | 1983   Canada       var2     312 |
         |----------------------------------|
      7. | 1981   Quebec       var1     121 |
      8. | 1982   Quebec       var1     221 |
      9. | 1983   Quebec       var1     321 |
     10. | 1981   Quebec       var2     122 |
     11. | 1982   Quebec       var2     222 |
     12. | 1983   Quebec       var2     322 |
         +----------------------------------+
    
    . reshape wide value, i(geo date) j(variables) string
    (note: j = var1 var2)
    
    Data                               long   ->   wide
    -----------------------------------------------------------------------------
    Number of obs.                       12   ->       6
    Number of variables                   4   ->       4
    j variable (2 values)         variables   ->   (dropped)
    xij variables:
                                      value   ->   valuevar1 valuevar2
    -----------------------------------------------------------------------------
    
    . rename (value*) (*)
    
    . list, sepby(geo)
    
         +-----------------------------+
         | date      geo   var1   var2 |
         |-----------------------------|
      1. | 1981   Canada    111    112 |
      2. | 1982   Canada    211    212 |
      3. | 1983   Canada    311    312 |
         |-----------------------------|
      4. | 1981   Quebec    121    122 |
      5. | 1982   Quebec    221    222 |
      6. | 1983   Quebec    321    322 |
         +-----------------------------+
    
    .