I have what is essentially allocation problem.
What I have: I have observations of small geographic areas, like census tracts. For each, I have the count of people in four different age groups. Each tract belongs to a subregion.
Now, I know the small area distribution is not entirely correct, because I know the correct distribution--at a higher level of aggregation, the subregion level, and the finer tract-level data, when summed, shows group totals that are different.
What I would like to have: I would like to adjust my tract-level, disaggregate distribution across four groups so it is consistent with a summary-level distribution across those four groups known to be correct, but retain the signals of the tract-level distribution--i.e. adjust it based on more coarse data, but not throw it out the window.
What I would like to do, then, is shift the tract-level population counts on the margins, meeting the following criteria, with the first two being most important (I realize there are tradeoffs with respect to meeting all of these):
Below follows mock data and placeholder code:
n=1000
np.random.seed(123)
df_small_area_scale = pd.DataFrame(data={
'grp1':np.random.randint(10,250,n),
'grp2':np.random.randint(10,250,n),
'grp3':np.random.randint(10,250,n),
'grp4':np.random.randint(10,250,n),
'subregion': np.random.choice(['A', 'B', 'C', 'D', 'E'],n),
'tract_id':range(1000)}).set_index(['subregion','tract_id'])
df_small_area_scale.head()
grp1 grp2 grp3 grp4
subregion tract_id
B 0 119 85 11 19
D 1 136 100 46 239
A 2 76 26 198 109
B 3 230 180 84 222
A 4 108 101 222 244
And, aggregating this by subregion
we get this:
df_small_area_scale.groupby(level=0).sum()
grp1 grp2 grp3 grp4
subregion
A 27241 27050 27471 26215
B 26507 24696 23315 24857
C 27474 28871 28882 28743
D 26671 26163 25077 27612
E 22739 23077 23797 24473
(And let's get the target shares for each subregion in each group)
summary_area_scale_shares = summary_area_scale.stack().groupby(level=0).apply(lambda x: x/float(x.sum()))
summary_area_scale_shares.head()
subregion
A grp1 0.244444
grp2 0.266667
grp3 0.244444
grp4 0.244444
B grp1 0.255319
dtype: float64
These are subregional "known" distributions. It is these I would like the tract-level data to be adjusted to, such that when tracts are aggregated, they match, roughly, these regional totals in each group. Specifically, grp4
in subregion A
sums to 26,215, but per the target, it should be 22,000, so tracts in subregion A should see persons re-classified from grp4
to some of the other groups.
summary_area_scale = pd.DataFrame(data={'grp1':[22000,24000,21000,25000,28000],
'grp2':[24000,22000,26000,20000,28000],
'grp3':[22000,24000,21000,25000,28000],
'grp4':[22000,24000,21000,25000,28000],
'subregion':list('ABCDE')}).set_index('subregion')
summary_area_scale
grp1 grp2 grp3 grp4
subregion
A 22000 24000 22000 22000
B 24000 22000 24000 24000
C 21000 26000 21000 21000
D 25000 20000 25000 25000
E 28000 28000 28000 28000
One idea is to sample tracts within each subregion and then move people in some proportion to the overall number of people needing to be moved from one bin to another, although I am not sure if there is a clever way of doing it meeting the criteria above.
What is causing me problems is mainly this identifying a way of reallocating people across groups to match the subregional total while maintaining the record-level totals and not completely throwing away the pre-existing spatial distribution, which I want to keep as a signal (but adjusted to a now known different overall distribution).
Any ideas as to how to, in general terms, make a detail distribution fit a more aggregate one, beyond just sampling tracts and moving x people from grp4 -> grp3
, grp2 -> grp1
and whatever the difference is between the existing and the target distributions?
This function is largely a lookup at a table with regional shares in each group, pushing that distribution to each tract, so it doesn't do anything but set up the data bindings.
def some_redistribution_algorithm(df):
"""
how many persons need to be moved across groups in each subregion?
minimal solution is to just take those shifts and apply uniformly
tracts keep the same counts, but the *distribution* across bins will change slightly
Quality criteria for algorithm:
- switch population at tract level such that
- tract-level population counts maintained
- Pre- and post-adjustment spatial distribution be largely unchanged
- change is not disproportional / dramatically impacting some tracts over others
(i.e. a tract with 10 grp4 population losing 8 would lose 80%, while a tract with 100 q4 hhs would lose 8%)
"""
adjustments = summary_area_scale.xs(df.name)
size = (adjustments).apply(lambda x: abs(x)).loc['grp4'].astype(np.int64)/df.shape[0]
print "Processing %s (%s tracts), beg. pop: %s, avg pop to move (here q4) %s" %(df.name,df.shape[0],
df.sum().loc['grp4'].astype(np.int64),size)
print 'Average pop per tract:'
print df.sum()/df.shape[0]
## tract-level distribution, if all tracts had the same distribution within each subregion (placeholder)
return df_small_area_scale.xs(df.name).mul(summary_area_scale_shares.unstack().xs(df.name),axis=1)
#samplerows= np.random.choice(a=df.index, size=size)
#df.loc[samplerows,:] = df.loc[samplerows,:]#, p=df.totalshare.tolist()),:]
df_small_area_scale.groupby(level=0).apply(some_redistribution_algorithm)
If I understand your question correctly, I think iterative proportional fitting might be what you're looking for. If I may, I would state a similar problem I had recently. This is the problem I was trying to solve:
I know the age distribution at the metropolitan level, and I know the total number of people in each tract, but because of how the census works, I think I know the distribution by age in each tract but I'm not sure.
I know I want to meet the total population within the tract (the row marginals) and I know I want to meet age distribution at the metropolitan level (the column marginals) and I want to "seed" ipf with the the distribution in each tract which is my best guess at the answer. Of course it doesn't work (I mean the numbers won't add up), so you immediately deviate from that guess in order to meet the marginals. And this is the purpose of iterative proportional fitting.
Perhaps not bulletproof but the code (in Python / numpy) I used was this:
# this should be fairly self explanitory if you know ipf
# seed_matrix is your best bet at the totals, col_marginals are
# observed column marginals and row_marginals is the same for rows
def simple_ipf(seed_matrix, col_marginals, row_marginals, tolerance=1, cnt=0):
assert np.absolute(row_marginals.sum() - col_marginals.sum()) < 5.0
# first normalize on columns
ratios = col_marginals / seed_matrix.sum(axis=0)
seed_matrix *= ratios
closeness = np.absolute(row_marginals - seed_matrix.sum(axis=1)).sum()
assert np.absolute(col_marginals - seed_matrix.sum(axis=0)).sum() < .01
# print "row closeness", closeness
if closeness < tolerance:
return seed_matrix
# first normalize on rows
ratios = row_marginals / seed_matrix.sum(axis=1)
ratios[row_marginals == 0] = 0
seed_matrix = seed_matrix * ratios.reshape((ratios.size, 1))
assert np.absolute(row_marginals - seed_matrix.sum(axis=1)).sum() < .01
closeness = np.absolute(col_marginals - seed_matrix.sum(axis=0)).sum()
# print "col closeness", closeness
if closeness < tolerance:
return seed_matrix
if cnt >= 50:
return seed_matrix
return simple_ipf(seed_matrix, col_marginals, row_marginals,
tolerance, cnt+1)