I'm working with genetic data in which alleles were observed n times in t number of chromosomes sequenced. In other words, n successes in t trials.
I want to include an estimate of each allele's frequency as a feature in a machine learning algorithm. I can of course get a point estimate with n/t, but I want to represent the confidence of that point estimate -- i.e. something about the likelihood of that estimate.
Now, I believe the negative binomial (or just binomial) distribution would be the right one to use, but
Thanks!
I suppose that all of the required information that you need can be calculated by mean of the standard statistical methods without applying machine learning.
MLE estimate of the parameter p of your Binomial distribution Bin(t,p) is just n/t as you properly suggested. If you want to get a confidence interval instead of a point estimate, there is one way to do it by means of the Wald method:
where z is 1 - 0.5α quantile of a standard normal distribution. You can find more possibilities via the following link depending on your modelling assumptions: Binomial confidence intervals.
95% CI for p̂ can be calculated as indicated above with z = 1.96.
As for the feature engineering for the machine learning algorithm: since your parametric distribution basically depends only on one estimated parameter p (except for t which is given), you can use it directly as a feature for the unique distribution representation. It is also possible to add CI or variance as additional features of course. Everything depends on what exactly you are going to learn and what is your final objective/criterion is.