Hello I want to use National Science Foundation dataset but the raw excel file variable names does not transpose the data properly. Does anyone have any sample code on how to transpose the dataset so that it fit the stata format for analysis. Here is the raw excel file so you can understand the problem.
*EDITED to add view of import
Keeping in mind Nick's apt point about the nature of SO, I do have some example code that may be helpful here. Without knowing what you mean by "transpose" I cannot give an exact answer, but you can adapt the reshape
commands below to your purposes.
Import and view
// Import Excel File defining TL and BR of table
import excel "nsb20197-tabs02-012.xlsx", cellrange(A6:K177) clear
list in 1/15
+----------------------------------------------------------------------------------------------------------------------------------------+
| A B C D E F G H I J K |
|----------------------------------------------------------------------------------------------------------------------------------------|
1. | Associate's level |
2. | American Indian or Alaska Native |
3. | All fields 6282 4131 2151 65.8 34.2 8935 5697 3238 63.8 36.2 |
4. | S&E 608 386 222 63.5 36.5 942 495 447 52.5 47.5 |
5. | Engineering 17 4 13 23.5 76.5 50 8 42 16 84 |
|----------------------------------------------------------------------------------------------------------------------------------------|
6. | Natural sciences 384 220 164 57.3 42.7 453 174 279 38.4 61.6 |
7. | Social and behavioral sciences 207 162 45 78.3 21.7 439 313 126 71.3 28.7 |
8. | Non-S&E 5674 3745 1929 66 34 7993 5202 2791 65.09999999999999 34.9 |
9. | Asian or Pacific Islander |
10. | All fields 27313 15522 11791 56.8 43.2 54809 30916 23893 56.4 43.6 |
|----------------------------------------------------------------------------------------------------------------------------------------|
11. | S&E 2649 1284 1365 48.5 51.5 7862 3492 4370 44.4 55.6 |
12. | Engineering 160 23 137 14.4 85.59999999999999 574 111 463 19.3 80.7 |
13. | Natural sciences 2010 939 1071 46.7 53.3 4419 1562 2857 35.3 64.7 |
14. | Social and behavioral sciences 479 322 157 67.2 32.8 2869 1819 1050 63.4 36.6 |
15. | Non-S&E 24664 14238 10426 57.7 42.3 46947 27424 19523 58.4 41.6 |
Structuring the data
// Create supercategories
* level = Column A if column A contains the word level (ignoring case)
gen level = word(A,1) if ustrregexm(A, "level", 1), before(A)
* demographic = Column A if next ob in column A contains the word all fields (ignoring case)
gen demographic = A if ustrregexm(A[_n+1], "all fields", 1), before(A)
* fill down demographic and level, and drop blank rows
foreach v of varlist level demographic {
replace `v' = `v'[_n-1] if missing(`v')
}
drop if mi(demographic) | demographic == A | regexm(A, level)
// rename variables
* rename A
rename A field
* rename count columns
local list "B C D E F"
local year = 2000
rename (`list') (all_`year' female_`year' male_`year' perc_female_`year' perc_male_`year' )
local list "G H I J K"
local year = 2017
rename (`list') (all_`year' female_`year' male_`year' perc_female_`year' perc_male_`year' )
* destring
destring *_2000 *_2017, replace
Reshaping to long
* reshape long
drop perc*
reshape long all_ male_ female_, i(level demographic field) j(year)
rename *_ degrees*
reshape long degrees, i(level demographic field year) j(gender) string
An example of how one might reshape wide by field
* test reshape wide by field + ensure the variable name is less than 32 characters post reshape
replace field = lower(strtoname(field))
replace field = substr(field, 1, 32 - strlen("degrees") - 1)
reshape wide degrees, i(level demographic year gender) j(field) string