Search code examples
pythonpandasdataframestata

View Stata variable labels in Pandas


Stata .dta files include labels/descriptions for each column, which can be viewed in Stata using the describe command. For example, the adults and kids variables in this online dataset, have descriptions number of adults in household and number of children in household, respectively:

clear
use http://www.principlesofeconometrics.com/stata/alcohol.dta

describe

Contains data from http://www.principlesofeconometrics.com/stata/alcohol.dta
  obs:         1,000                          
 vars:             4                          10 Nov 2007 11:33
 size:         5,000                          (_dta has notes)
-------------------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------------------------------------------------------------
adults          byte    %8.0g                 number of adults in household
kids            byte    %8.0g                 number of children in household
income          int     %8.0g                 weekly income
consume         byte    %8.0g                 =1 if consume alcohol, =0 otherwise
-------------------------------------------------------------------------------------------------------------------------------------
Sorted by: 

Those descriptions do not show up in Pandas, for example with describe():

df = pd.read_stata('http://www.principlesofeconometrics.com/stata/alcohol.dta')
df

     adults  kids  income  consume
0         2     2     758        1
1         2     3    1785        1
2         3     0    1200        1
..      ...   ...     ...      ...
997       2     0    1383        1
998       2     2     816        0
999       2     2     387        0

df.describe()

            adults         kids       income      consume
count  1000.000000  1000.000000  1000.000000  1000.000000
mean      2.012000     0.722000   649.528000     0.766000
std       0.815181     1.078833   460.657826     0.423584
min       1.000000     0.000000    12.000000     0.000000
25%       2.000000     0.000000   295.000000     1.000000
50%       2.000000     0.000000   562.500000     1.000000
75%       2.000000     1.000000   887.500000     1.000000
max       6.000000     5.000000  3846.000000     1.000000

Is there a way to view this information after loading it to a Pandas DataFrame using read_stata()?


Solution

  • Using Stata's toy dataset auto as an example:

    sysuse auto, clear
    
    describe
    
    Contains data from auto.dta
      obs:            74                          1978 Automobile Data
     vars:            12                          13 Apr 2014 17:45
     size:         3,182                          (_dta has notes)
    -------------------------------------------------------------------------------------------------------------------------------------
                  storage   display    value
    variable name   type    format     label      variable label
    -------------------------------------------------------------------------------------------------------------------------------------
    make            str18   %-18s                 Make and Model
    price           int     %8.0gc                Price
    mpg             int     %8.0g                 Mileage (mpg)
    rep78           int     %8.0g                 Repair Record 1978
    headroom        float   %6.1f                 Headroom (in.)
    trunk           int     %8.0g                 Trunk space (cu. ft.)
    weight          int     %8.0gc                Weight (lbs.)
    length          int     %8.0g                 Length (in.)
    turn            int     %8.0g                 Turn Circle (ft.)
    displacement    int     %8.0g                 Displacement (cu. in.)
    gear_ratio      float   %6.2f                 Gear Ratio
    foreign         byte    %8.0g      origin     Car type
    -------------------------------------------------------------------------------------------------------------------------------------
    Sorted by: foreign
    

    The following works for me:

    import pandas as pd
    data = pd.read_stata('auto.dta', iterator = True)
    labels = data.variable_labels()
    labels
    
    Out[5]: 
    {'make': 'Make and Model',
     'price': 'Price',
     'mpg': 'Mileage (mpg)',
     'rep78': 'Repair Record 1978',
     'headroom': 'Headroom (in.)',
     'trunk': 'Trunk space (cu. ft.)',
     'weight': 'Weight (lbs.)',
     'length': 'Length (in.)',
     'turn': 'Turn Circle (ft.) ',
     'displacement': 'Displacement (cu. in.)',
     'gear_ratio': 'Gear Ratio',
     'foreign': 'Car type'}