Search code examples
pythonscipyhistogramprobability-distribution

How to calculate a probability distribution from an array with scipy


my actual goal is to calculate the difference between two histograms. For this I would like to use the Kullback-Leibler-Divergenz. In this thread Calculating KL Divergence in Python it was said that Scipy's entropy function will calculate KL divergence. For this I need a probability distribution of my datasets. I tried to follow the answers and instructions given in those 2 threads How do I calculate PDF (probability density function) in Python? and How to effectively compute the pdf of a given dataset. Unfortunately I always get an error.

Here you can see my code in which I subdivide the data into 3 parts (training, validation and test dataset) and aim to calculate the pairwise-difference between the data distribution of those 3 sets.

import scipy
from scipy.stats import norm
from scipy.stats import rv_histogram
import numpy as np
import pandas as pd



#Reading the data
df = pd.read_csv("C:/Users/User1/Desktop/TestData_Temperatures.csv", sep=';')

#Build training, validation and test data set      
timeslot_x_train_end = int(len(df)* 0.7)
timeslot_x_valid_end = int(len(df)* 0.9)

data_histogram = df[['temperatures']].values
data_train_histogram = data_histogram [:timeslot_x_train_end]
data_valid_histogram = data_histogram [timeslot_x_train_end:timeslot_x_valid_end]
data_test_histogram = data_histogram [timeslot_x_valid_end:]

#Make histogram out of numpy array
histogram_train = rv_histogram(np.histogram(data_train_histogram, bins='auto'))
histogram_valid = rv_histogram(np.histogram(data_valid_histogram, bins='auto'))
histogram_test = rv_histogram(np.histogram(data_test_histogram, bins='auto'))

#Make probability distribution out of the histogram
pdfs_train = norm.cdf(histogram_train, histogram_train.mean(), histogram_train.std())
pdfs_valid = norm.cdf(histogram_valid, histogram_valid.mean(), histogram_valid.std())
pdfs_test = norm.cdf(histogram_test, histogram_valid.mean(), histogram_valid.std())

#Calculate the entropy between the different datasets
entropy_train_valid = scipy.special.rel_entr(pdfs_train, pdfs_valid)   
entropy_train_test = scipy.special.rel_entr(pdfs_train, pdfs_test) 
entropy_valid_test = scipy.special.rel_entr(pdfs_valid, pdfs_test) 

#Calculate the Kullback–Leibler divergence between the different datasets
kl_div_train_valid = np.sum(entropy_train_valid)
kl_div_train_test = np.sum(entropy_train_test)
kl_div_valid_test = np.sum(entropy_valid_test)

#Print the values of the Kullback–Leibler divergence
print(f"Kullback–Leibler divergence between training and validation dataset: {kl_div_train_valid}")
print(f"Kullback–Leibler divergence between training and test dataset: {kl_div_train_test}")
print(f"Kullback–Leibler divergence between validation and test dataset: {kl_div_valid_test}")

In this setup I get the error message "TypeError: unsupported operand type(s) for -: 'rv_histogram' and 'float'" thrown by the line pdfs_train = norm.cdf(histogram_train, histogram_train.mean(), histogram_train.std()). Here you can see the test dataset TestData. Do you have an idea why I get this error and how I can calculate the probability distribution from the arrays (and eventually the Kullback–Leibler divergence)?

Reminder: Any comments? I'll appreciate every comment.


Solution

  • An histogram of a sample can approximate the pdf of the distribution under the assumption that the sample is enough to capture the distribution of the population. So when you use histogram_train = rv_histogram(np.histogram(data_train_histogram, bins='auto')) it generates a distribution given by a histogram. It has a .pdf method to evaluate the pdf and also .rvs to generate values that follow this distribution. So to calculate the Kullback–Leibler divergence between two distributions you can do the following:

    #Reading the data
    df = pd.read_csv("C:/Users/User1/Desktop/TestData_Temperatures.csv", sep=';')
    
    #Build training, validation   
    timeslot_x_train_end = int(len(df)* 0.7)
    timeslot_x_valid_end = int(len(df)* 0.9)
    
    data = df[['temperatures']].values
    data_train = data[:timeslot_x_train_end]
    data_valid = data[timeslot_x_train_end:timeslot_x_valid_end]
    
    #Make distribution objects of the histograms
    histogram_dist_train = rv_histogram(np.histogram(data_train, bins='auto'))
    histogram_dist_valid = rv_histogram(np.histogram(data_valid, bins='auto'))
    
    #Generate arrays of pdf evaluations
    X1 = np.linspace(np.min(data_train), np.max(data_train), 1000)
    X2 = np.linspace(np.min(data_valid), np.max(data_valid), 1000)
    rvs_train = [histogram_dist_train.pdf(x) for x in X1]
    rvs_valid = [histogram_dist_valid.pdf(x) for x in X2]
    
    #Calculate the Kullback–Leibler divergence between the different datasets
    entropy_train_valid = scipy.special.rel_entr(rvs_train, rvs_valid)   
    kl_div_train_valid = np.sum(entropy_train_valid)
    
    #Print the values of the Kullback–Leibler divergence
    print(f"Kullback–Leibler divergence between training and validation dataset: {kl_div_train_valid}")
    

    On the other hand, if you assume that the data have a normal distribution then you must do the following:

    #Build training, validation   
    timeslot_x_train_end = int(len(df)* 0.7)
    timeslot_x_valid_end = int(len(df)* 0.9)
    
    data = df[['temperatures']].values
    data_train = data[:timeslot_x_train_end]
    data_valid = data[timeslot_x_train_end:timeslot_x_valid_end]
    
    #Make normal distribution objects from data mean and standard deviation
    norm_dist_train = norm(data_train.mean(), data_train.std())
    norm_dist_valid = norm(data_valid.mean(), data_valid.std())
    
    #Generate arrays of pdf evaluations
    X1 = np.linspace(np.min(data_train), np.max(data_train), 1000)
    X2 = np.linspace(np.min(data_valid), np.max(data_valid), 1000)
    rvs_train = [norm_dist_train.pdf(x) for x in X1]
    rvs_valid = [norm_dist_valid.pdf(x) for x in X2]
    
    #Calculate the Kullback–Leibler divergence between the different datasets
    entropy_train_valid = scipy.special.rel_entr(rvs_train, rvs_valid)       
    kl_div_train_valid = np.sum(entropy_train_valid)
    
    #Print the values of the Kullback–Leibler divergence
    print(f"Kullback–Leibler divergence between training and validation dataset: {kl_div_train_valid}")