I have a survey data looks like this:
No. of Respondents | Reported height |
---|---|
40 | <165 |
77 | 166-170 |
218 | 171-175 |
236 | 176-180 |
327 | 181+ |
I want to calculate the average height of the respondents in Excel. What kind of formula should I apply?
Thanks in advance!
This is indeed a very subtle question. Let me first discuss my thoughts and solution before posting the code.
It remains to find the normal distribution with a good fit to your interval-censored data. Given the normality assumption, I compute the maximum likelihood estimator as follows:
The code:
import numpy as np
from scipy.stats import norm
from scipy.optimize import minimize
# Input characteristics (as column vectors)
#NumberOfRespondents = np.array([[40],[70],[218],[236],[327]])
NumberOfRespondents = np.array([[40],[77],[218],[236],[327]])
HeightCutOff = np.array([[165],[170],[175],[180]])
# Number of intervals
L = len(HeightCutOff)+1
# Compute probabilities for given mu and sigma
def NormalProbabilities(ParaVector):
mu = ParaVector[0]
sigma = ParaVector[1]
# Compute interval probabilities
ProbList = np.zeros((L, 1))
for i in range(L):
# Probability for (-inf,165)
if i==0:
ProbList[i] = norm.cdf(HeightCutOff[0], loc=mu, scale=sigma)
# Probability for (180,inf)
elif i==(L-1):
ProbList[i] = 1-norm.cdf(HeightCutOff[-1], loc=mu, scale=sigma)
# Other probabilities
else:
ProbList[i] = norm.cdf(HeightCutOff[i], loc=mu, scale=sigma)-norm.cdf(HeightCutOff[i-1], loc=mu, scale=sigma)
# Output
return ProbList
# Compute loglikelihood for given mu and sigma
def minusLogLikelihood(ParaVector):
# Compute interval probabilities
plist = NormalProbabilities(ParaVector)
# Compute log-likelihood
LogLik = np.sum(NumberOfRespondents*np.log(plist))
return -LogLik
# Optimization negative likelihood
ParaStart = [170, 20];
OptimizerBounds = [(-np.inf, np.inf), (0.01, np.inf)]
OptimResult = minimize(minusLogLikelihood, ParaStart, method="Nelder-Mead", bounds=OptimizerBounds)
# Report results
print('\nMean estimate: ', OptimResult.x[0])
print('Std estimate: ', OptimResult.x[1], '\n')
print('Empirical probabilites: ', (NumberOfRespondents/np.sum(NumberOfRespondents)).T)
print('Estimated probabilities: ', (NormalProbabilities(OptimResult.x)).T, '\n')
print('True category counts: ', NumberOfRespondents.T)
print('Estimated category counts: ', (NormalProbabilities(OptimResult.x)*np.sum(NumberOfRespondents)).T)
These are the outputs:
Mean estimate: 177.50905604361458
Std estimate: 7.118910358800334
Empirical probabilites: [[0.04454343 0.0857461 0.24276169 0.26280624 0.36414254]]
Estimated probabilities: [[0.03944537 0.10631209 0.21649315 0.27454448 0.36320491]]
True category counts: [[ 40 77 218 236 327]]
Estimated category counts: [[ 35.42194412 95.46825299 194.41085156 246.54094083 326.1580105 ]]
In conclusion, based on the normality assumption your average height would be 177.5cm. The fit of the data seems OK (but not amazing).
PS1: I don't have access to Excel. I hope the explanations might help you to translate the code to Excel. PS2: There are differences of 1 between the upper bound of an interval and the lower bound of the next interval. As such, I'm not completely sure to which the interval the remaining values should be contributed (say values within 165 and 166). You might want to change the HeightCutfOff array if needed.