I try to open a .dta as DataFrame. But an Error appears:
"ValueError: Value labels for column ... are not unique. The repeated labels are:" followed by labels which appear twice in a column.
I know labeling multiple codes with the exact same value label in stata is not clever (not my fault). After some research I know, pandas will not accept repeated value labels (this IS clever).
But I can't figure out a (good) solution. Is there:
a. a smooth way to open the data with pandas and just rename the doubles (like "label" to "label(2)") in this process?
here is what the data looks like (value labels in brackets):
| multilabel
1 | 11 (oneone or twotwo)
2 | 22 (oneone or twotwo)
3 | 33 (other-label-which-is-unique)
my code so far:
import pandas as pd
#followed by any option that delivers this solution:
dataframe = pd.read_stata('file.dta')
or
b. a fast an easy way to tell stata: just rename all repeated value labels by "label(2)" instead of "label"? and yes, the code so far is also rather boring:
use "file.dta"
*followed by a loop which finds repeated labels and changes them
save "file.dta", replace
And yes, there are to many repeated value labels to go trough it one by one.
And here the Stata-Commands to produce a minimal example:
set obs 1
generate var1 = 1 in 1
set obs 2
replace var1 = 2 in 2
set obs 3
replace var1 = 3 in 3
generate var2 = 11 in 1
replace var2 = 22 in 2
replace var2 = 33 in 3
rename var2 multilabel
label define labelrepeat 11 "oneone or twotwo" 22 "oneone or twotwo"
label values multilabel labelrepeat
I am happy for each suggestion!
If you have a variable with repeated labels, then
decode multilabel, gen(valuelabel)
label values multilabel
puts the value labels in a string variable and then undoes the association of multilabel
values and the previously attached value labels. I don't know what else you need to do and thus why you do anything else. You now have the same information as before. I don't know whether pandas will ignore the definition of value labels.
For completeness, here's a way to find out which variables have value labels that aren't in one-to-one correspondence with numeric values.
* your sandbox, simplified and extended
clear
set obs 3
generate var1 = _n
generate multilabel = 11 * _n
label define labelrepeat 11 "oneone or twotwo" 22 "oneone or twotwo"
label values multilabel labelrepeat
label define var1 1 "frog" 2 "toad" 3 "newt"
label val var1 var1
* my code
local bad
ds *, has(vallabel)
quietly foreach v in `r(varlist)' {
tempvar decoded diff
decode `v', gen(`decoded')
bysort `decoded' (`v') : gen `diff' = `v'[1] != `v'[_N] & !missing(`decoded')
count if `diff'
if r(N) > 0 local bad `bad' `v'
drop `decoded' `diff'
}
di "`bad'"