Search code examples
pythonscientific-computing

Mean of data with different propabilities


I'm currently facing the following problem: During an experiment I gathered the counts of events per channel (detector). Plotting the counts against the channels gives almost a normal distribution. I'd now like to calculate the mean of this data set. The problem is that not all of the events that generated the data occur with the same probability, but the probability for each channel is known.

To me this situation seems very similar to wanting to calculate the mean of a histogram, therefore I would take the middle value of the channels, multiply it by the corresponding value of the channel, sum all the values up and then divide by the total number of channels.

My implementation for this is:

import numpy as np
import matplotlib.pyplot as plt
counts = ... # see at the end of the post for the data set in question
channels = np.arange(1,len(counts)+1)
channel_probability = .... # probability for different parts of channels

mean = sum((channels+1)/2 * counts)/len(counts)

plt.figure()
plt.plot(counts, channels)
plt.stem([mean], [100])
plt.xlabel("channels")
plt.ylabel("counts")
plt.show()

The problem is that this assumes the same probability for all the events... Therefore I tried the naive approach of just multiplying the probability as well

mean = sum((channels+1)/2 * counts * channels_probability)/len(counts)

But this of course only led to completely unreasonable results... So, can someone maybe explain how I would find the mean of such a distribution and how to calculate it?


As mentioned above, here is the something similar to the data set I'm using:

counts = np.array([2.05209753  2.07860064  2.06269877  2.0706497   2.07595033  2.03619567
  2.03619567  2.06269877  2.02029381  2.00439194  2.01499318  1.9937907
  1.98583977  1.99909132  1.99909132  2.00439194  1.98583977  1.98849008
  1.99644101  2.01499318  2.00439194  2.0176435   2.02824474  1.99909132
  2.00174163  2.03354536  2.05474784  2.05474784  2.04944722  2.11305467
  2.07330002  2.13955778  2.18461305  2.19256399  2.21906709  2.25617144
  2.23496895  2.25617144  2.31182796  2.32772982  2.36483417  2.3992882
  2.42844162  2.49734969  2.56890807  2.56095714  2.59541118  2.59541118
  2.63516583  2.68817204  2.6272149   2.66961987  2.6272149   2.66961987
  2.60336211  2.62191428  2.56890807  2.5503559   2.53975466  2.52385279
  2.45229441  2.42844162  2.39133727  2.29592609  2.27737392  2.26147206
  2.21906709  2.14220809  2.17666212  2.09185219  2.03619567  2.02824474
  2.05209753  2.00439194  1.97788884  1.97788884  1.9672876   1.96463729
  1.96993791  1.95403604  1.94608511  1.9434348   1.9434348   1.93548387
  1.93813418  1.9434348   1.94078449  1.93813418  1.94078449  1.9434348])

Solution

  • My assumptions:

    • You have the counts of events for each channel.
    • You know the probability of each channel.

    Suppose you have a nice dice with nine sides, each side has a number.

    numbs = [10, 24, 26, 8, 17, 6, 9, 15, 20]

    Each number has the same probability: 1/9. You may ask, what is the expected value of the dice? Well, with Python it is easy.

    prob_li = []
    for l, prob in zip(numbs, [1/9] * 9):
        prob_li.append(l * prob)
    
    print(sum(prob_li)) # 15
    

    If the probability of each side changes, say something like

    probs = [1/9, 1/9, 1/9, 1/9, 1/9, 1/10, 1/20, 1/20, 11/45]

    the expected value is

    prob_li = [] 
        for l, prob in zip(numbs, probs): 
        prob_li.append(l * prob) 
    
    print(sum(prob_li)) # 16.13
    

    Now suppose that you construct a matrix, and each column has a probability probs[i]

    np.random.seed(4)
    mat = np.random.randint(6, 20, size=(3, 9))
    

    mat is a matrix with a shape (3,9). I would find the expected value as

    result = mat * probs
    print(sum(mat.mean(axis=0) * probs)) #12.82
    print(sum(result.sum(axis=0) * probs)) #38.46
    

    For me, 12.82 has more sense than 38.46. Besides, you said that plotting the counts against the channels gives almost a normal distribution, you would only need to find the mean of each channel and then the expected value.