Search code examples
excelstatisticsaveragesurvey

How to calculate an average value from categorical data?


I have a survey data looks like this:

No. of Respondents Reported height
40 <165
77 166-170
218 171-175
236 176-180
327 181+

I want to calculate the average height of the respondents in Excel. What kind of formula should I apply?

Thanks in advance!


Solution

  • This is indeed a very subtle question. Let me first discuss my thoughts and solution before posting the code.

    • As expected, there will be quite some respondents with a reported height of more than 180cm (I hope I'm interpreting the question correctly and these are heights in cm). For about a third of the respondents we thus have very little information about their actual height. If we would assume a typical value for all these respondents, then this (somewhat arbitrary) assumption will have a large impact on the outcome. As suggested by @RobertDodier, it's probably better to fit a distribution to the data and let this distribution extrapolate into the tails.
    • The height distribution of a population is often considered to be well-approximated by a normal distribution (especially if different normal distributions are used for male and female heights). I'm not sure whether this additional information is available from your survey data. If yes, you might exploit this info. This info being absent I will consider a single normal distribution.

    It remains to find the normal distribution with a good fit to your interval-censored data. Given the normality assumption, I compute the maximum likelihood estimator as follows:

    1. For a given mean and standard deviation of the normal distribution, we can calculate the probability for each of the intervals. This leads to 5 probabilities.
    2. If the data is a random sample of respondents, then the data follows a categorical distribution and the log-likelihood can be computed. There is only one twist. The five probabilities are jointly specified by the two parameters of the normal distribution (mean and standard deviation).
    3. Let a numerical optimiser minimise the negative log-likelihood.

    The code:

    import numpy as np
    from scipy.stats import norm
    from scipy.optimize import minimize
    
    # Input characteristics (as column vectors)
    #NumberOfRespondents = np.array([[40],[70],[218],[236],[327]])
    NumberOfRespondents = np.array([[40],[77],[218],[236],[327]])
    HeightCutOff = np.array([[165],[170],[175],[180]])
    
    # Number of intervals
    L = len(HeightCutOff)+1
    
    # Compute probabilities for given mu and sigma
    def NormalProbabilities(ParaVector):
        mu = ParaVector[0]
        sigma = ParaVector[1]
    
        # Compute interval probabilities
        ProbList = np.zeros((L, 1))
        for i in range(L):
            # Probability for (-inf,165)
            if i==0:
                ProbList[i] = norm.cdf(HeightCutOff[0], loc=mu, scale=sigma)
            # Probability for (180,inf)
            elif i==(L-1):
                ProbList[i] = 1-norm.cdf(HeightCutOff[-1], loc=mu, scale=sigma)
            # Other probabilities
            else:
                ProbList[i] = norm.cdf(HeightCutOff[i], loc=mu, scale=sigma)-norm.cdf(HeightCutOff[i-1], loc=mu, scale=sigma)
        # Output
        return ProbList
    
    # Compute loglikelihood for given mu and sigma
    def minusLogLikelihood(ParaVector):
        # Compute interval probabilities
        plist = NormalProbabilities(ParaVector)
    
        # Compute log-likelihood
        LogLik = np.sum(NumberOfRespondents*np.log(plist))
        return -LogLik
    
    # Optimization negative likelihood
    ParaStart = [170, 20];
    OptimizerBounds = [(-np.inf, np.inf), (0.01, np.inf)]
    OptimResult = minimize(minusLogLikelihood, ParaStart, method="Nelder-Mead", bounds=OptimizerBounds)
    
    # Report results
    print('\nMean estimate: ', OptimResult.x[0])
    print('Std estimate: ', OptimResult.x[1], '\n')
    
    print('Empirical probabilites: ', (NumberOfRespondents/np.sum(NumberOfRespondents)).T)
    print('Estimated probabilities: ', (NormalProbabilities(OptimResult.x)).T, '\n')
    
    print('True category counts: ', NumberOfRespondents.T)
    print('Estimated category counts: ', (NormalProbabilities(OptimResult.x)*np.sum(NumberOfRespondents)).T)
    

    These are the outputs:

    Mean estimate:  177.50905604361458
    Std estimate:  7.118910358800334 
    
    Empirical probabilites:  [[0.04454343 0.0857461  0.24276169 0.26280624 0.36414254]]
    Estimated probabilities:  [[0.03944537 0.10631209 0.21649315 0.27454448 0.36320491]] 
    
    True category counts:  [[ 40  77 218 236 327]]
    Estimated category counts:  [[ 35.42194412  95.46825299 194.41085156 246.54094083 326.1580105 ]]
    

    In conclusion, based on the normality assumption your average height would be 177.5cm. The fit of the data seems OK (but not amazing).

    PS1: I don't have access to Excel. I hope the explanations might help you to translate the code to Excel. PS2: There are differences of 1 between the upper bound of an interval and the lower bound of the next interval. As such, I'm not completely sure to which the interval the remaining values should be contributed (say values within 165 and 166). You might want to change the HeightCutfOff array if needed.