my actual goal is to calculate the difference between two histograms. For this I would like to use the Kullback-Leibler-Divergenz. In this thread Calculating KL Divergence in Python it was said that Scipy's entropy function will calculate KL divergence. For this I need a probability distribution of my datasets. I tried to follow the answers and instructions given in those 2 threads How do I calculate PDF (probability density function) in Python? and How to effectively compute the pdf of a given dataset. Unfortunately I always get an error.
Here you can see my code in which I subdivide the data into 3 parts (training, validation and test dataset) and aim to calculate the pairwise-difference between the data distribution of those 3 sets.
import scipy
from scipy.stats import norm
from scipy.stats import rv_histogram
import numpy as np
import pandas as pd
#Reading the data
df = pd.read_csv("C:/Users/User1/Desktop/TestData_Temperatures.csv", sep=';')
#Build training, validation and test data set
timeslot_x_train_end = int(len(df)* 0.7)
timeslot_x_valid_end = int(len(df)* 0.9)
data_histogram = df[['temperatures']].values
data_train_histogram = data_histogram [:timeslot_x_train_end]
data_valid_histogram = data_histogram [timeslot_x_train_end:timeslot_x_valid_end]
data_test_histogram = data_histogram [timeslot_x_valid_end:]
#Make histogram out of numpy array
histogram_train = rv_histogram(np.histogram(data_train_histogram, bins='auto'))
histogram_valid = rv_histogram(np.histogram(data_valid_histogram, bins='auto'))
histogram_test = rv_histogram(np.histogram(data_test_histogram, bins='auto'))
#Make probability distribution out of the histogram
pdfs_train = norm.cdf(histogram_train, histogram_train.mean(), histogram_train.std())
pdfs_valid = norm.cdf(histogram_valid, histogram_valid.mean(), histogram_valid.std())
pdfs_test = norm.cdf(histogram_test, histogram_valid.mean(), histogram_valid.std())
#Calculate the entropy between the different datasets
entropy_train_valid = scipy.special.rel_entr(pdfs_train, pdfs_valid)
entropy_train_test = scipy.special.rel_entr(pdfs_train, pdfs_test)
entropy_valid_test = scipy.special.rel_entr(pdfs_valid, pdfs_test)
#Calculate the Kullback–Leibler divergence between the different datasets
kl_div_train_valid = np.sum(entropy_train_valid)
kl_div_train_test = np.sum(entropy_train_test)
kl_div_valid_test = np.sum(entropy_valid_test)
#Print the values of the Kullback–Leibler divergence
print(f"Kullback–Leibler divergence between training and validation dataset: {kl_div_train_valid}")
print(f"Kullback–Leibler divergence between training and test dataset: {kl_div_train_test}")
print(f"Kullback–Leibler divergence between validation and test dataset: {kl_div_valid_test}")
In this setup I get the error message "TypeError: unsupported operand type(s) for -: 'rv_histogram' and 'float'" thrown by the line pdfs_train = norm.cdf(histogram_train, histogram_train.mean(), histogram_train.std())
. Here you can see the test dataset TestData. Do you have an idea why I get this error and how I can calculate the probability distribution from the arrays (and eventually the Kullback–Leibler divergence)?
Reminder: Any comments? I'll appreciate every comment.
An histogram of a sample can approximate the pdf of the distribution under the assumption that the sample is enough to capture the distribution of the population. So when you use histogram_train = rv_histogram(np.histogram(data_train_histogram, bins='auto'))
it generates a distribution given by a histogram. It has a .pdf
method to evaluate the pdf and also .rvs
to generate values that follow this distribution. So to calculate the Kullback–Leibler divergence between two distributions you can do the following:
#Reading the data
df = pd.read_csv("C:/Users/User1/Desktop/TestData_Temperatures.csv", sep=';')
#Build training, validation
timeslot_x_train_end = int(len(df)* 0.7)
timeslot_x_valid_end = int(len(df)* 0.9)
data = df[['temperatures']].values
data_train = data[:timeslot_x_train_end]
data_valid = data[timeslot_x_train_end:timeslot_x_valid_end]
#Make distribution objects of the histograms
histogram_dist_train = rv_histogram(np.histogram(data_train, bins='auto'))
histogram_dist_valid = rv_histogram(np.histogram(data_valid, bins='auto'))
#Generate arrays of pdf evaluations
X1 = np.linspace(np.min(data_train), np.max(data_train), 1000)
X2 = np.linspace(np.min(data_valid), np.max(data_valid), 1000)
rvs_train = [histogram_dist_train.pdf(x) for x in X1]
rvs_valid = [histogram_dist_valid.pdf(x) for x in X2]
#Calculate the Kullback–Leibler divergence between the different datasets
entropy_train_valid = scipy.special.rel_entr(rvs_train, rvs_valid)
kl_div_train_valid = np.sum(entropy_train_valid)
#Print the values of the Kullback–Leibler divergence
print(f"Kullback–Leibler divergence between training and validation dataset: {kl_div_train_valid}")
On the other hand, if you assume that the data have a normal distribution then you must do the following:
#Build training, validation
timeslot_x_train_end = int(len(df)* 0.7)
timeslot_x_valid_end = int(len(df)* 0.9)
data = df[['temperatures']].values
data_train = data[:timeslot_x_train_end]
data_valid = data[timeslot_x_train_end:timeslot_x_valid_end]
#Make normal distribution objects from data mean and standard deviation
norm_dist_train = norm(data_train.mean(), data_train.std())
norm_dist_valid = norm(data_valid.mean(), data_valid.std())
#Generate arrays of pdf evaluations
X1 = np.linspace(np.min(data_train), np.max(data_train), 1000)
X2 = np.linspace(np.min(data_valid), np.max(data_valid), 1000)
rvs_train = [norm_dist_train.pdf(x) for x in X1]
rvs_valid = [norm_dist_valid.pdf(x) for x in X2]
#Calculate the Kullback–Leibler divergence between the different datasets
entropy_train_valid = scipy.special.rel_entr(rvs_train, rvs_valid)
kl_div_train_valid = np.sum(entropy_train_valid)
#Print the values of the Kullback–Leibler divergence
print(f"Kullback–Leibler divergence between training and validation dataset: {kl_div_train_valid}")