Search code examples
pythonscipystatisticsprobabilitystatistical-test

Comparing datasets to nonstandard probability distributions in Python


I have a few large sets of data which I have used to create non-standard probability distributions (using numpy.histogram to bin the data, and scipy.interpolate's interp1d function to interpolate the resulting curves). I have also created a function which can sample from these custom PDFs using the scipy.stats package.

My goal is to see how varying the size of my samples changes the goodness of fit to both the distributions they came from, and the other PDFs as well, and determine how large a sample is necessary to completely determine whether it came from one or other of my custom PDFs.

To do this I've gathered that I need to use some sort of nonparametric statistical analysis, i.e. seeing whether a set of data has been drawn from a provided probability distribution. Doing a bit of research, it seems like the Anderson-Darling test is ideal for this, however its implementation in python (scipy.stats.anderson) seems to only be usable for preset probability distributions such as normal, exponential, etc.

So my question is: given my many nonstandard PDFs (or CDFs if necessary, or the data I used to create them) what is the best way to work out how well a set of sample data fits each model in Python? If it is the Anderson-Darling test, is there some way of defining a custom PDF to test against?

Thanks. Any help is much appreciated.


Solution

  • (1) "Is it from distribution X" is generally a question which can be answered a priori, if at all; a statistical test for it will only tell you "I have a large sample / not a large sample", which may be true but not too useful. If you are trying to classify new data into one distribution or another, my advice is to look at it as a classification problem and use your constructed pdf's to compute p(class | data) = p(data | class) p(class) / p(data) where the key part p(data | class) is your histogram. Maybe you can say more about your problem domain.

    (2) You could apply the Kolmogorov-Smirnov test, but it's really pointless, as mentioned above.