I have a census frequency distribution and want to calculate the median
please.
import pandas as pd
import math
import numpy as np
geo_code 1 2 3 4 5 6 7
0 815 1026 735 1344 569 2688 741
1228801 - 2457600 305 104 74 177 84 10 40
153601 - 307200 2028 2330 2341 1720 1757 585 1695
19201 - 38400 408 642 505 2002 377 2495 747
1 - 4800 28 38 31 288 54 553 51
2500000 129 67 81 85 69 10 43
307201 - 614400 2044 1903 1775 1611 1833 262 1272
38401 - 76800 613 1202 944 1706 729 1499 862
4801 - 9600 52 56 60 328 43 848 92
614401- 1228800 1254 627 528 773 702 58 229
76801 - 153600 1305 1943 1741 1516 1264 771 1132
9601 - 19200 167 401 237 1048 248 1762 425
00 2 1 0 1 0 0 0
df['new'] = df.index
df[['Upper', 'Lower']] = df['new'].str.split('-', expand=True)
df["Lower"] = df["Lower"].fillna(0)
df['Xi'] = (df['Upper'].astype(float) + df['Lower'].astype(float))/2
print(df.head(2))
geo_code 1 2 3 4 5 6 7 new Upper Lower Xi
0 815 1026 735 1344 569 2688 741 0 0 0 0.0
1228801 - 2457600 305 104 74 177 84 10 40 1228801 - 2457600 1228801 2457600 1843200.5
153601 - 307200 2028 2330 2341 1720 1757 585 1695 153601 - 307200 153601 307200 230400.5
Now the function
to calculate the median would be:
def median_(val, freq):
ord = np.argsort(val)
cdf = np.cumsum(freq[ord])
return val[ord][np.searchsorted(cdf, cdf[-1] // 2)]
where val
is df.Xi
and freq
is a column (1 through 116. culled here for minimum working example)
How do I parse this to df.apply()
so that the result would be a new row? Possibly something like: df.loc['median'] = df.apply(... )
with each median
under its respective column?
You can count mean
of splitted values in index, so possible count new row by all columns:
s = df.index.to_series().str.split('\s*-\s*', expand=True).astype(float).mean(axis=1)
def median_(val, freq):
ord = np.argsort(val)
cdf = np.cumsum(freq[ord])
return val[ord][np.searchsorted(cdf, cdf[-1] // 2)]
df.loc['median'] = df.apply(lambda col: median_(s, col))
print (df)
1 2 3 4 5 6 \
geo_code
0 815.0 1026.0 735.0 1344.0 569.0 2688.0
1228801 - 2457600 305.0 104.0 74.0 177.0 84.0 10.0
153601 - 307200 2028.0 2330.0 2341.0 1720.0 1757.0 585.0
19201 - 38400 408.0 642.0 505.0 2002.0 377.0 2495.0
1 - 4800 28.0 38.0 31.0 288.0 54.0 553.0
2500000 129.0 67.0 81.0 85.0 69.0 10.0
307201 - 614400 2044.0 1903.0 1775.0 1611.0 1833.0 262.0
38401 - 76800 613.0 1202.0 944.0 1706.0 729.0 1499.0
4801 - 9600 52.0 56.0 60.0 328.0 43.0 848.0
614401- 1228800 1254.0 627.0 528.0 773.0 702.0 58.0
76801 - 153600 1305.0 1943.0 1741.0 1516.0 1264.0 771.0
9601 - 19200 167.0 401.0 237.0 1048.0 248.0 1762.0
00 2.0 1.0 0.0 1.0 0.0 0.0
median 230400.5 115200.5 230400.5 57600.5 230400.5 14400.5
7
geo_code
0 741.0
1228801 - 2457600 40.0
153601 - 307200 1695.0
19201 - 38400 747.0
1 - 4800 51.0
2500000 43.0
307201 - 614400 1272.0
38401 - 76800 862.0
4801 - 9600 92.0
614401- 1228800 229.0
76801 - 153600 1132.0
9601 - 19200 425.0
00 0.0
median 115200.5
If need new column Xi
:
df['Xi']=df.index.to_series().str.split('\s*-\s*', expand=True).astype(float).mean(axis=1)
def median_(val, freq):
ord = np.argsort(val)
cdf = np.cumsum(freq[ord])
return val[ord][np.searchsorted(cdf, cdf[-1] // 2)]
df.loc['median'] = df.apply(lambda col: median_(df['Xi'], col))
print (df)
1 2 3 4 5 6 \
geo_code
0 815.0 1026.0 735.0 1344.0 569.0 2688.0
1228801 - 2457600 305.0 104.0 74.0 177.0 84.0 10.0
153601 - 307200 2028.0 2330.0 2341.0 1720.0 1757.0 585.0
19201 - 38400 408.0 642.0 505.0 2002.0 377.0 2495.0
1 - 4800 28.0 38.0 31.0 288.0 54.0 553.0
2500000 129.0 67.0 81.0 85.0 69.0 10.0
307201 - 614400 2044.0 1903.0 1775.0 1611.0 1833.0 262.0
38401 - 76800 613.0 1202.0 944.0 1706.0 729.0 1499.0
4801 - 9600 52.0 56.0 60.0 328.0 43.0 848.0
614401- 1228800 1254.0 627.0 528.0 773.0 702.0 58.0
76801 - 153600 1305.0 1943.0 1741.0 1516.0 1264.0 771.0
9601 - 19200 167.0 401.0 237.0 1048.0 248.0 1762.0
00 2.0 1.0 0.0 1.0 0.0 0.0
median 230400.5 115200.5 230400.5 57600.5 230400.5 14400.5
7 Xi
geo_code
0 741.0 0.0
1228801 - 2457600 40.0 1843200.5
153601 - 307200 1695.0 230400.5
19201 - 38400 747.0 28800.5
1 - 4800 51.0 2400.5
2500000 43.0 2500000.0
307201 - 614400 1272.0 460800.5
38401 - 76800 862.0 57600.5
4801 - 9600 92.0 7200.5
614401- 1228800 229.0 921600.5
76801 - 153600 1132.0 115200.5
9601 - 19200 425.0 14400.5
00 0.0 0.0
median 115200.5 1843200.5