Search code examples
formatstatatranspose

Transposing Data Variable names in Stata


Hello I want to use National Science Foundation dataset but the raw excel file variable names does not transpose the data properly. Does anyone have any sample code on how to transpose the dataset so that it fit the stata format for analysis. Here is the raw excel file so you can understand the problem.

Raw Excel File from NSF website


Solution

  • *EDITED to add view of import

    Keeping in mind Nick's apt point about the nature of SO, I do have some example code that may be helpful here. Without knowing what you mean by "transpose" I cannot give an exact answer, but you can adapt the reshape commands below to your purposes.

    Import and view

    // Import Excel File defining TL and BR of table
        import excel "nsb20197-tabs02-012.xlsx", cellrange(A6:K177) clear
        list in 1/15
    
         +----------------------------------------------------------------------------------------------------------------------------------------+
         |                                A       B       C       D      E                   F       G       H       I                   J      K |
         |----------------------------------------------------------------------------------------------------------------------------------------|
      1. |                Associate's level                                                                                                       |
      2. | American Indian or Alaska Native                                                                                                       |
      3. |                       All fields    6282    4131    2151   65.8                34.2    8935    5697    3238                63.8   36.2 |
      4. |                              S&E     608     386     222   63.5                36.5     942     495     447                52.5   47.5 |
      5. |                      Engineering      17       4      13   23.5                76.5      50       8      42                  16     84 |
         |----------------------------------------------------------------------------------------------------------------------------------------|
      6. |                 Natural sciences     384     220     164   57.3                42.7     453     174     279                38.4   61.6 |
      7. |   Social and behavioral sciences     207     162      45   78.3                21.7     439     313     126                71.3   28.7 |
      8. |                          Non-S&E    5674    3745    1929     66                  34    7993    5202    2791   65.09999999999999   34.9 |
      9. |        Asian or Pacific Islander                                                                                                       |
     10. |                       All fields   27313   15522   11791   56.8                43.2   54809   30916   23893                56.4   43.6 |
         |----------------------------------------------------------------------------------------------------------------------------------------|
     11. |                              S&E    2649    1284    1365   48.5                51.5    7862    3492    4370                44.4   55.6 |
     12. |                      Engineering     160      23     137   14.4   85.59999999999999     574     111     463                19.3   80.7 |
     13. |                 Natural sciences    2010     939    1071   46.7                53.3    4419    1562    2857                35.3   64.7 |
     14. |   Social and behavioral sciences     479     322     157   67.2                32.8    2869    1819    1050                63.4   36.6 |
     15. |                          Non-S&E   24664   14238   10426   57.7                42.3   46947   27424   19523                58.4   41.6 |
    
    
    
    

    Structuring the data

    // Create supercategories
        
        * level = Column A if column A contains the word level (ignoring case)
        gen level = word(A,1) if ustrregexm(A, "level", 1), before(A)
        
        * demographic = Column A if next ob in column A contains the word all fields (ignoring case)
        gen demographic = A if ustrregexm(A[_n+1], "all fields", 1), before(A)
        
        * fill down demographic and level, and drop blank rows
        foreach v of varlist level demographic {
            replace `v' = `v'[_n-1] if missing(`v')
        }
        drop if mi(demographic) | demographic == A | regexm(A, level)
        
    // rename variables 
    
        * rename A
        rename A field
        
        * rename count columns
        local list "B C D E F" 
        local year = 2000
        rename (`list') (all_`year' female_`year' male_`year' perc_female_`year' perc_male_`year' )
        
        local list "G H I J K" 
        local year = 2017
        rename (`list') (all_`year' female_`year' male_`year' perc_female_`year' perc_male_`year' )
        
        * destring 
        destring *_2000 *_2017, replace
    

    Reshaping to long

        * reshape long
        drop perc*
        reshape long all_ male_ female_, i(level demographic field) j(year)
        
        rename *_ degrees*
        reshape long degrees,  i(level demographic field year) j(gender) string
    

    An example of how one might reshape wide by field

        * test reshape wide by field + ensure the variable name is less than 32 characters post reshape 
        replace field = lower(strtoname(field))
        replace field = substr(field, 1, 32 - strlen("degrees") - 1) 
        reshape wide degrees, i(level demographic year gender) j(field) string