Search code examples
pythonpandasstatalabel

Stata to Pandas: even if there are repeated Value Labels?


I try to open a .dta as DataFrame. But an Error appears:

"ValueError: Value labels for column ... are not unique. The repeated labels are:" followed by labels which appear twice in a column.

I know labeling multiple codes with the exact same value label in stata is not clever (not my fault). After some research I know, pandas will not accept repeated value labels (this IS clever).

But I can't figure out a (good) solution. Is there:

a. a smooth way to open the data with pandas and just rename the doubles (like "label" to "label(2)") in this process?

here is what the data looks like (value labels in brackets):

  | multilabel    
1 | 11 (oneone or twotwo)
2 | 22 (oneone or twotwo)
3 | 33 (other-label-which-is-unique)

my code so far:

import pandas as pd

#followed by any option that delivers this solution:
dataframe = pd.read_stata('file.dta')

or

b. a fast an easy way to tell stata: just rename all repeated value labels by "label(2)" instead of "label"? and yes, the code so far is also rather boring:

use "file.dta"

*followed by a loop which finds repeated labels and changes them

save "file.dta", replace

And yes, there are to many repeated value labels to go trough it one by one.

And here the Stata-Commands to produce a minimal example:

set obs 1
generate var1 = 1 in 1
set obs 2
replace var1 = 2 in 2
set obs 3
replace var1 = 3 in 3
generate var2 = 11 in 1
replace var2 = 22 in 2
replace var2 = 33 in 3
rename var2 multilabel
label define labelrepeat 11 "oneone or twotwo" 22 "oneone or twotwo"
label values multilabel labelrepeat

I am happy for each suggestion!


Solution

  • If you have a variable with repeated labels, then

    decode multilabel, gen(valuelabel)
    label values multilabel
    

    puts the value labels in a string variable and then undoes the association of multilabel values and the previously attached value labels. I don't know what else you need to do and thus why you do anything else. You now have the same information as before. I don't know whether pandas will ignore the definition of value labels.

    For completeness, here's a way to find out which variables have value labels that aren't in one-to-one correspondence with numeric values.

    * your sandbox, simplified and extended  
    clear 
    set obs 3
    generate var1 = _n 
    generate multilabel = 11 * _n
    label define labelrepeat 11 "oneone or twotwo" 22 "oneone or twotwo"
    label values multilabel labelrepeat
    
    label define var1 1 "frog" 2 "toad" 3 "newt"
    label val var1 var1 
    
    
    * my code 
    local bad 
    ds *, has(vallabel) 
    
    quietly foreach v in `r(varlist)' { 
        tempvar decoded diff 
        decode `v', gen(`decoded') 
        bysort `decoded' (`v') : gen `diff' = `v'[1] != `v'[_N] & !missing(`decoded') 
        count if `diff' 
        if r(N) > 0 local bad `bad' `v' 
        drop `decoded' `diff' 
    } 
    
    di "`bad'"