python pandas numpy statistics probability

Calculate Mode, Median and Skewness of frequency dataframe

I have a dataframe like this:

Category   Frequency
1             30000
2             45000
3             32400
4             42200
5             56300
6             98200

How do I calculate the mean, median and skewness of the Categories?

I have tried the following:

df['cum_freq'] = [df["Category"]]*df["Frequnecy"]
mean = df['cum_freq'].mean()
median = df['cum_freq'].median()
skew = df['cum_freq'].skew()

Solution

If the total frequency is small enough to fit into memory, use repeat to generate the data and then you can easily call those methods.

s = df['Category'].repeat(df['Frequency']).reset_index(drop=True)

print(s.mean(), s.var(ddof=1), s.skew(), s.kurtosis())
# 4.13252219664584 3.045585008424625 -0.4512924988072343 -1.1923306818513022

Otherwise, you will need more complicated algebra to calculate the moments, which can be done with the k-statistics Some of the lower moments can be done with other libraries like numpy or statsmodels. But for things like skewness and kurtosis this is done manually from the sums of the de-meaned values (calculated from counts). Since these sums will overflow numpy, we need use normal python.

def moments_from_counts(vals, weights):
    """
    Returns tuple (mean, N-1 variance, skewness, kurtosis) from count data
    """
    vals = [float(x) for x in vals]
    weights = [float(x) for x in weights]

    n = sum(weights)
    mu = sum([x*y for x,y in zip(vals,weights)])/n
    S1 = sum([(x-mu)**1*y for x,y in zip(vals,weights)])
    S2 = sum([(x-mu)**2*y for x,y in zip(vals,weights)])
    S3 = sum([(x-mu)**3*y for x,y in zip(vals,weights)])
    S4 = sum([(x-mu)**4*y for x,y in zip(vals,weights)])

    k1 = S1/n
    k2 = (n*S2-S1**2)/(n*(n-1))
    k3 = (2*S1**3 - 3*n*S1*S2 + n**2*S3)/(n*(n-1)*(n-2))
    k4 = (-6*S1**4 + 12*n*S1**2*S2 - 3*n*(n-1)*S2**2 -4*n*(n+1)*S1*S3 + n**2*(n+1)*S4)/(n*(n-1)*(n-2)*(n-3))
    
    return mu, k2, k3/k2**1.5, k4/k2**2

moments_from_counts(df['Category'], df['Frequency'])
#(4.13252219664584, 3.045585008418879, -0.4512924988072345, -1.1923306818513018)

statsmodels has a nice class that takes care of lower moments, as well as the quantiles.

from statsmodels.stats.weightstats import DescrStatsW

d = DescrStatsW(df['Category'], weights=df['Frequency'])

d.mean
#4.13252219664584

d.var_ddof(1)
#3.045585008418879

the DescrStatsW class also gives you access to the underlying data as an array if you call d.asrepeats()