Search code examples
pythonmachine-learningprobability

How to check if sample has same probability distribution as population in Python?


I have a Dataframe with millions of rows, to create a model, I have taken a random sample from this dataset using dataset.sample(int(len(dataset)/5)) which returns a random sample of items from an axis of the object. Now I want to verify if the sample does not lose statistical significance from the population, that is, ensure the probability distribution of each of the features (columns) of the sample has the same probability distribution for the whole dataset (population). I have numerical as well as categorical features. How can I check that the features have the same probability distribution in Python?


Solution

  • For the continuous variables you can use a Kolmogorov-Smirnov statistic. This tests if two samples are drawn from the same distribution.

    Usage in scipy:

    scipy.stats.ks_2samp(data1, data2, alternative='two-sided', mode='auto')
    

    https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ks_2samp.html

    Alternatively if you already know the distribution you can use the KS-test, that tests your data against a given distribution:

    https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html#scipy.stats.kstest