I am writing some code in Stata, but the current iteration is quite inefficient. I know how I would change things if I were to write this in python/pandas to make things better, but am unsure how to do this/if it is possible in Stata. For workflow reasons I need to keep this code in the Stata language.
The basic problem is that I have two categorical variables, cata
and catb
. Each can take on several thousand values. I'd like to reduce the number of categories in catb
by keeping a category only if at least two observations occur in the same cata
. I have used egen group()
to ensure that the categories are numeric, and the maximum value is the number of categories for both variables.
My current code looks like this:
foreach var in cata catb {
qui sum `var'
local `var'_max = r(max)
}
local list
forvalues i=1/`cata_max' {
if mod(`i',10)==0 di `i' " of " `cata_max'
forvalues i=j/`catb_max' {
qui count if cata==`i' & catb==`j'
if r(N)>1 local list `list' `j'
}
}
gen mask=0
foreach i in `list' {
replace mask=1 if catb==`i'
}
replace catb=-1 if mask==0
I believe this code runs and will work, but I'm estimating completion in ~16 hours for data with ~100k observations.
The problematic piece of this is the nested loop, as I am looping over all possible values of catb
, even though only about 10ish show up (cata
average size is 20-30). I would like to select only the relevant portion of catb
to loop over in the nested loop, rather than all possible values of catb
. In pandas/python I'd do something like:
val_list=[]
for i in df['cata'].unique():
df_temp=df[df['cata']==i]
rel_vals=df_temp['catb'].unique()
for j in rel_vals:
if len(df_temp[df_temp['catb']])>1:
val_list+=[j]
Which, while not great code, illustrates the idea of looping over only the relevant values of catb
. This should reduce the time taken in the second loop by ~300X, which will yield a substantial speedup even if there's some overhead to finding the unique values.
How do I get only the unique values of a variable within this loop structure? My searches yield only posts like this, which is a roundabout way to calculate unique values of a variable (and does not lend itself well to unique values of a subset of the data as in my situation).
Example:
index cata catb catb2
0 A 1 -1
1 B 2 2
2 C 1 -1
3 A 2 2
4 B 2 2
5 C 2 2
Explanation: catb2
is the new version of catb
, which replaces category "1" because there is no cata
in which multiple "1"s appear. Category "2" remains untouched, because it shows up twice in cata
B.
If I understand this correctly, you want to drop
combinations of cata
and catb
that occur just once:
bysort cata catb: drop if _N == 1