Search code examples
pythonnumpypandaswikifrequency-distribution

Shifting disaggregate distribution to match more aggregate level distribution


I have what is essentially allocation problem.

What I have: I have observations of small geographic areas, like census tracts. For each, I have the count of people in four different age groups. Each tract belongs to a subregion.

Now, I know the small area distribution is not entirely correct, because I know the correct distribution--at a higher level of aggregation, the subregion level, and the finer tract-level data, when summed, shows group totals that are different.

What I would like to have: I would like to adjust my tract-level, disaggregate distribution across four groups so it is consistent with a summary-level distribution across those four groups known to be correct, but retain the signals of the tract-level distribution--i.e. adjust it based on more coarse data, but not throw it out the window.

What I would like to do, then, is shift the tract-level population counts on the margins, meeting the following criteria, with the first two being most important (I realize there are tradeoffs with respect to meeting all of these):

  1. it should match, when aggregated, the subregional totals.
  2. adjustment should not change tract level population.
  3. the existing spatial distribution should not be materially changed, but just me marginally adjusted per the new subregional totals
  4. adjustments should ideally be equitable--i.e. adjustments should not just be on a few records, but be more distributed within each region.

Below follows mock data and placeholder code:

First, the small-area data:

n=1000
np.random.seed(123)
df_small_area_scale = pd.DataFrame(data={
        'grp1':np.random.randint(10,250,n),
        'grp2':np.random.randint(10,250,n),
        'grp3':np.random.randint(10,250,n),
        'grp4':np.random.randint(10,250,n),
        'subregion': np.random.choice(['A', 'B', 'C', 'D', 'E'],n),
        'tract_id':range(1000)}).set_index(['subregion','tract_id'])


df_small_area_scale.head()
                    grp1  grp2  grp3  grp4
subregion tract_id                        
B         0          119    85    11    19
D         1          136   100    46   239
A         2           76    26   198   109
B         3          230   180    84   222
A         4          108   101   222   244

And, aggregating this by subregion we get this:

df_small_area_scale.groupby(level=0).sum()
            grp1   grp2   grp3   grp4
subregion                            
A          27241  27050  27471  26215
B          26507  24696  23315  24857
C          27474  28871  28882  28743
D          26671  26163  25077  27612
E          22739  23077  23797  24473

(And let's get the target shares for each subregion in each group)

summary_area_scale_shares = summary_area_scale.stack().groupby(level=0).apply(lambda x: x/float(x.sum()))
summary_area_scale_shares.head()

subregion      
A          grp1    0.244444
           grp2    0.266667
           grp3    0.244444
           grp4    0.244444
B          grp1    0.255319
dtype: float64

Second, what the small area data should sum to, at the subregional level.

These are subregional "known" distributions. It is these I would like the tract-level data to be adjusted to, such that when tracts are aggregated, they match, roughly, these regional totals in each group. Specifically, grp4 in subregion A sums to 26,215, but per the target, it should be 22,000, so tracts in subregion A should see persons re-classified from grp4 to some of the other groups.

summary_area_scale = pd.DataFrame(data={'grp1':[22000,24000,21000,25000,28000],
                                        'grp2':[24000,22000,26000,20000,28000],
                                        'grp3':[22000,24000,21000,25000,28000],
                                        'grp4':[22000,24000,21000,25000,28000],
                                        'subregion':list('ABCDE')}).set_index('subregion')
summary_area_scale
            grp1   grp2   grp3   grp4
subregion                            
A          22000  24000  22000  22000
B          24000  22000  24000  24000
C          21000  26000  21000  21000
D          25000  20000  25000  25000
E          28000  28000  28000  28000

One idea is to sample tracts within each subregion and then move people in some proportion to the overall number of people needing to be moved from one bin to another, although I am not sure if there is a clever way of doing it meeting the criteria above.

What is causing me problems is mainly this identifying a way of reallocating people across groups to match the subregional total while maintaining the record-level totals and not completely throwing away the pre-existing spatial distribution, which I want to keep as a signal (but adjusted to a now known different overall distribution).

Any ideas as to how to, in general terms, make a detail distribution fit a more aggregate one, beyond just sampling tracts and moving x people from grp4 -> grp3, grp2 -> grp1 and whatever the difference is between the existing and the target distributions?

Placeholder code

This function is largely a lookup at a table with regional shares in each group, pushing that distribution to each tract, so it doesn't do anything but set up the data bindings.

def some_redistribution_algorithm(df):
    """
    how many persons need to be moved across groups in each subregion?
    minimal solution is to just take those shifts and apply uniformly
    tracts keep the same counts, but the *distribution* across bins will change slightly
    Quality criteria for algorithm:
    - switch population at tract level such that 
    - tract-level population counts maintained
    - Pre- and post-adjustment spatial distribution be largely unchanged
    - change is not disproportional / dramatically impacting some tracts over others 
      (i.e. a tract with 10 grp4 population losing 8 would lose 80%, while a tract with 100 q4 hhs would lose 8%)

    """

    adjustments = summary_area_scale.xs(df.name)

    size = (adjustments).apply(lambda x: abs(x)).loc['grp4'].astype(np.int64)/df.shape[0]
    print "Processing %s (%s tracts), beg. pop: %s, avg pop to move (here q4) %s" %(df.name,df.shape[0],
                                                                                   df.sum().loc['grp4'].astype(np.int64),size)
    print 'Average pop per tract:'
    print df.sum()/df.shape[0]


    ## tract-level distribution, if all tracts had the same distribution within each subregion (placeholder)

    return df_small_area_scale.xs(df.name).mul(summary_area_scale_shares.unstack().xs(df.name),axis=1)

    #samplerows= np.random.choice(a=df.index, size=size)
    #df.loc[samplerows,:] = df.loc[samplerows,:]#, p=df.totalshare.tolist()),:]
df_small_area_scale.groupby(level=0).apply(some_redistribution_algorithm)

Solution

  • If I understand your question correctly, I think iterative proportional fitting might be what you're looking for. If I may, I would state a similar problem I had recently. This is the problem I was trying to solve:

    I know the age distribution at the metropolitan level, and I know the total number of people in each tract, but because of how the census works, I think I know the distribution by age in each tract but I'm not sure.

    I know I want to meet the total population within the tract (the row marginals) and I know I want to meet age distribution at the metropolitan level (the column marginals) and I want to "seed" ipf with the the distribution in each tract which is my best guess at the answer. Of course it doesn't work (I mean the numbers won't add up), so you immediately deviate from that guess in order to meet the marginals. And this is the purpose of iterative proportional fitting.

    Perhaps not bulletproof but the code (in Python / numpy) I used was this:

    # this should be fairly self explanitory if you know ipf
    # seed_matrix is your best bet at the totals, col_marginals are
    # observed column marginals and row_marginals is the same for rows
    
    def simple_ipf(seed_matrix, col_marginals, row_marginals, tolerance=1, cnt=0):
        assert np.absolute(row_marginals.sum() - col_marginals.sum()) < 5.0
    
        # first normalize on columns
        ratios = col_marginals / seed_matrix.sum(axis=0)
        seed_matrix *= ratios
        closeness = np.absolute(row_marginals - seed_matrix.sum(axis=1)).sum()
        assert np.absolute(col_marginals - seed_matrix.sum(axis=0)).sum() < .01
        # print "row closeness", closeness
        if closeness < tolerance:
            return seed_matrix
    
        # first normalize on rows
        ratios = row_marginals / seed_matrix.sum(axis=1)
        ratios[row_marginals == 0] = 0
        seed_matrix = seed_matrix * ratios.reshape((ratios.size, 1))
        assert np.absolute(row_marginals - seed_matrix.sum(axis=1)).sum() < .01
        closeness = np.absolute(col_marginals - seed_matrix.sum(axis=0)).sum()
        # print "col closeness", closeness
        if closeness < tolerance:
            return seed_matrix
    
        if cnt >= 50:
            return seed_matrix
    
        return simple_ipf(seed_matrix, col_marginals, row_marginals,
                          tolerance, cnt+1)