Search code examples
countnested-loopsstatacoding-efficiency

Number of unique values taken by a variable for a subset of the data, within a loop


I am writing some code in Stata, but the current iteration is quite inefficient. I know how I would change things if I were to write this in python/pandas to make things better, but am unsure how to do this/if it is possible in Stata. For workflow reasons I need to keep this code in the Stata language.

The basic problem is that I have two categorical variables, cata and catb. Each can take on several thousand values. I'd like to reduce the number of categories in catb by keeping a category only if at least two observations occur in the same cata. I have used egen group() to ensure that the categories are numeric, and the maximum value is the number of categories for both variables.

My current code looks like this:

foreach var in cata catb {
    qui sum `var'
    local `var'_max = r(max)
}
local list
forvalues i=1/`cata_max' {
    if mod(`i',10)==0 di `i' " of " `cata_max'
    forvalues i=j/`catb_max' {
        qui count if cata==`i' & catb==`j'
        if r(N)>1 local list `list' `j'
    }
}
gen mask=0
foreach i in `list' {
    replace mask=1 if catb==`i'
}
replace catb=-1 if mask==0

I believe this code runs and will work, but I'm estimating completion in ~16 hours for data with ~100k observations.

The problematic piece of this is the nested loop, as I am looping over all possible values of catb, even though only about 10ish show up (cata average size is 20-30). I would like to select only the relevant portion of catb to loop over in the nested loop, rather than all possible values of catb. In pandas/python I'd do something like:

val_list=[]
for i in df['cata'].unique():
    df_temp=df[df['cata']==i]
    rel_vals=df_temp['catb'].unique()
    for j in rel_vals:
        if len(df_temp[df_temp['catb']])>1:
            val_list+=[j]

Which, while not great code, illustrates the idea of looping over only the relevant values of catb. This should reduce the time taken in the second loop by ~300X, which will yield a substantial speedup even if there's some overhead to finding the unique values.

How do I get only the unique values of a variable within this loop structure? My searches yield only posts like this, which is a roundabout way to calculate unique values of a variable (and does not lend itself well to unique values of a subset of the data as in my situation).

Example:

index   cata    catb    catb2
0       A       1       -1
1       B       2       2
2       C       1       -1
3       A       2       2
4       B       2       2
5       C       2       2

Explanation: catb2 is the new version of catb, which replaces category "1" because there is no cata in which multiple "1"s appear. Category "2" remains untouched, because it shows up twice in cata B.


Solution

  • If I understand this correctly, you want to drop combinations of cata and catb that occur just once:

     bysort cata catb: drop if _N == 1