I am trying to come up with a very fast algorithm for calculating this very interesting statistic that takes full advantage of the capabilities of a powerful GPU. Ideally I will do this in Matlab using Jacket, but other ideas in CUDA code or OpenCL would also be much appreciated. Basically, I want to crowd-source lots of clever ideas that I can try to put together, and then I will attempt to open-source the result so others can use it.
Despite the power of this dependency coefficient (it is capable of detecting even "one-to-many" dependency relationships), there is almost nothing about it online, except for two sources: SAS statistical software, and the excellent R package Hmisc by Frank Harrell. You can read a description of the algorithm here:
And here is Harrell's code in Fortran (which is surprisingly easy to follow if you understand the calculation already):
http://hmisc.sourcearchive.com/documentation/3.7-0/hoeffd_8f-source.html
(also, page 128 of the pdf documentation for Hmisc has some more details.)
This is a very computationally demanding algorithm-- if you wanted to apply it to a data set that is thousands of rows and a few thousand columns, even with the fast Fortran implementation, you would be waiting many days for the result-- even on a new machine. My hope is that using an Nvidia GTX 580 level card, or better yet, a Tesla, would bring that down to a couple hours. In which case, the combination would be an analytical force to be reckoned with, whether identifying genes or finding correlations in experimental data.
Anyway, I look forward to peoples' responses and hope we can make a fast, GPU based algorithm for Hoeffding's D a reality.
Thanks in advance for any ideas-- and please don't hesitate to give partial or half-baked ideas!
Update: So, Jascha has generously provided a working implementation of Hoeffding's D in Matlab, which fulfilled one of my objectives. The other was to radically enhance the speed by using a GPU, preferably using Jacket. Does anyone see a path or strategy to do this in a clever way on the GPU?
Below is a code example with a simple implementation of the Hoeffding's D measure of dependency in MATLAB. This is NOT GPU-ized, but may prove useful to people who want to calculate this statistic and are not using Fortran, or as a starting point for putting it on the GPU. (The extended header illustrates Hoeffding's D values for several types of joint distributions.)
function [ D ] = hoeffdingsD( x, y )
%Compute's Hoeffding's D measure of dependence between x and y
% inputs x and y are both N x 1 arrays
% output D is a scalar
% The formula for Hoeffding's D is taken from
% http://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#procstat_corr_sect016.htm
% Below is demonstration code for several types of dependencies.
%
% % this case should be 0 - there are no dependencies
% x = randn(1000,1);
% y = randn(1000,1);
% D = hoeffdingsD( x, y );
% desc = 'x = N( 0, 1 ), y and x independent';
% desc = sprintf( '%s, Hoeffding''s D = %f', desc, D );
% fprintf( '%s\n', desc );
% figure; plot(x,y,'.'); title( desc );
%
% % the rest of these cases have dependencies of different types
% x = randn(1000,1);
% y = x;
% D = hoeffdingsD( x, y );
% desc = 'x = N( 0, 1 ), y = x';
% desc = sprintf( '%s, Hoeffding''s D = %f', desc, D );
% fprintf( '%s\n', desc );
% figure; plot(x,y,'.'); title( desc );
%
% x = randn(1000,1);
% y = cos(10*x);
% D = hoeffdingsD( x, y );
% desc = 'x = N( 0, 1 ), y = cos(10x)';
% desc = sprintf( '%s, Hoeffding''s D = %f', desc, D );
% fprintf( '%s\n', desc );
% figure; plot(x,y,'.'); title( desc );
%
% x = randn(1000,1);
% y = x.^2;
% D = hoeffdingsD( x, y );
% desc = 'x = N( 0, 1 ), y = x^2';
% desc = sprintf( '%s, Hoeffding''s D = %f', desc, D );
% fprintf( '%s\n', desc );
% figure; plot(x,y,'.'); title( desc );
%
% x = randn(1000,1);
% y = x.^2 + randn(1000,1);
% D = hoeffdingsD( x, y );
% desc = 'x = N( 0, 1 ), y = x^2 + N( 0, 1 )';
% desc = sprintf( '%s, Hoeffding''s D = %f', desc, D );
% fprintf( '%s\n', desc );
% figure; plot(x,y,'.'); title( desc );
%
% x = rand(1000,1);
% y = rand(1000,1);
% z = sign(randn(1000,1));
% x = z.*x; y = z.*y;
% D = hoeffdingsD( x, y );
% desc = 'x = z U( [0, 1) ), y = z U( [0, 1) ), z = U( {-1,1} )';
% desc = sprintf( '%s, Hoeffding''s D = %f', desc, D );
% fprintf( '%s\n', desc );
% figure; plot(x,y,'.'); title( desc );
%
% x = rand(1000,1);
% y = rand(1000,1);
% z = sign(randn(1000,1));
% x = z.*x; y = -z.*y;
% D = hoeffdingsD( x, y );
% desc = 'x = z U( [0, 1) ), y = -z U( [0, 1) ), z = U( {-1,1} )';
% desc = sprintf( '%s, Hoeffding''s D = %f', desc, D );
% fprintf( '%s\n', desc );
% figure; plot(x,y,'.'); title( desc );
N = size(x,1);
R = tiedrank( x );
S = tiedrank( y );
Q = zeros(N,1);
parfor i = 1:N
Q(i) = 1 + sum( R < R(i) & S < S(i) );
% and deal with cases where one or both values are ties, which contribute less
Q(i) = Q(i) + 1/4 * (sum( R == R(i) & S == S(i) ) - 1); % both indices tie. -1 because we know point i matches
Q(i) = Q(i) + 1/2 * sum( R == R(i) & S < S(i) ); % one index ties.
Q(i) = Q(i) + 1/2 * sum( R < R(i) & S == S(i) ); % one index ties.
end
D1 = sum( (Q-1).*(Q-2) );
D2 = sum( (R-1).*(R-2).*(S-1).*(S-2) );
D3 = sum( (R-2).*(S-2).*(Q-1) );
D = 30*((N-2)*(N-3)*D1 + D2 - 2*(N-2)*D3) / (N*(N-1)*(N-2)*(N-3)*(N-4));
end