Stata .dta
files include labels/descriptions for each column, which can be viewed in Stata using the describe
command. For example, the adults
and kids
variables in this online dataset, have descriptions number of adults in household
and number of children in household
, respectively:
clear
use http://www.principlesofeconometrics.com/stata/alcohol.dta
describe
Contains data from http://www.principlesofeconometrics.com/stata/alcohol.dta
obs: 1,000
vars: 4 10 Nov 2007 11:33
size: 5,000 (_dta has notes)
-------------------------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------------------------------------------------------------
adults byte %8.0g number of adults in household
kids byte %8.0g number of children in household
income int %8.0g weekly income
consume byte %8.0g =1 if consume alcohol, =0 otherwise
-------------------------------------------------------------------------------------------------------------------------------------
Sorted by:
Those descriptions do not show up in Pandas, for example with describe()
:
df = pd.read_stata('http://www.principlesofeconometrics.com/stata/alcohol.dta')
df
adults kids income consume
0 2 2 758 1
1 2 3 1785 1
2 3 0 1200 1
.. ... ... ... ...
997 2 0 1383 1
998 2 2 816 0
999 2 2 387 0
df.describe()
adults kids income consume
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 2.012000 0.722000 649.528000 0.766000
std 0.815181 1.078833 460.657826 0.423584
min 1.000000 0.000000 12.000000 0.000000
25% 2.000000 0.000000 295.000000 1.000000
50% 2.000000 0.000000 562.500000 1.000000
75% 2.000000 1.000000 887.500000 1.000000
max 6.000000 5.000000 3846.000000 1.000000
Is there a way to view this information after loading it to a Pandas DataFrame using read_stata()
?
Using Stata's toy dataset auto
as an example:
sysuse auto, clear
describe
Contains data from auto.dta
obs: 74 1978 Automobile Data
vars: 12 13 Apr 2014 17:45
size: 3,182 (_dta has notes)
-------------------------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------------------------------------------------------------
make str18 %-18s Make and Model
price int %8.0gc Price
mpg int %8.0g Mileage (mpg)
rep78 int %8.0g Repair Record 1978
headroom float %6.1f Headroom (in.)
trunk int %8.0g Trunk space (cu. ft.)
weight int %8.0gc Weight (lbs.)
length int %8.0g Length (in.)
turn int %8.0g Turn Circle (ft.)
displacement int %8.0g Displacement (cu. in.)
gear_ratio float %6.2f Gear Ratio
foreign byte %8.0g origin Car type
-------------------------------------------------------------------------------------------------------------------------------------
Sorted by: foreign
The following works for me:
import pandas as pd
data = pd.read_stata('auto.dta', iterator = True)
labels = data.variable_labels()
labels
Out[5]:
{'make': 'Make and Model',
'price': 'Price',
'mpg': 'Mileage (mpg)',
'rep78': 'Repair Record 1978',
'headroom': 'Headroom (in.)',
'trunk': 'Trunk space (cu. ft.)',
'weight': 'Weight (lbs.)',
'length': 'Length (in.)',
'turn': 'Turn Circle (ft.) ',
'displacement': 'Displacement (cu. in.)',
'gear_ratio': 'Gear Ratio',
'foreign': 'Car type'}