Search code examples
rstatisticscorrelationentropy

Is it possible to specify the correlation between two distributions?


For context, say there were two academic exams --morning and afternoon-- conducted. I'm only given the summary statistics -- mean, median, skew and kurtosis for the scores on both exams, so I'm unable to say exactly how many students passed, but I can estimate it by fitting the moments and creating a custom pearson distribution. I can estimate, for example, how many students passed the first and the second exam, as well as giving it a standard deviation to quantify my error.

What I would like to do is to estimate the number of students who pass the course, defined as having the average score of both morning and afternoon exams being over 60%. If the performance of students on both tests are completely independent, I suppose this would be easy - I just generate scores for both tests in the form of two lists, average them, count the number of items over 60%, and repeat, say 10000 times.

If both tests are completely dependent, I suppose I would have to order both lists, because the student scoring the highest on the morning exam should also score the highest on the second. What I'm missing is how I should measure the degree of randomness/interdependence (maybe it has something to do with entropy?) in between, where students who score highly on exam 1 also score highly on exam 2, and if there is a package in R that I can use to specify an arbitrary degree of entropy between two variables.


Solution

  • A renowned concept for measuring entropy between two distributions is KL divergence:

    In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy) is a measure of how one probability distribution is different from a second, reference probability distribution.

    To make the measure symmetric, you can use Jensen-Shannon divergence as well.

    For the implementation of KL divergence, you can use this package in R.