Search code examples
matlabstatisticscudagpucorrelation

Ideas for GPU implementation of Hoeffding's "D" (Dependence) coefficient?


I am trying to come up with a very fast algorithm for calculating this very interesting statistic that takes full advantage of the capabilities of a powerful GPU. Ideally I will do this in Matlab using Jacket, but other ideas in CUDA code or OpenCL would also be much appreciated. Basically, I want to crowd-source lots of clever ideas that I can try to put together, and then I will attempt to open-source the result so others can use it.

Despite the power of this dependency coefficient (it is capable of detecting even "one-to-many" dependency relationships), there is almost nothing about it online, except for two sources: SAS statistical software, and the excellent R package Hmisc by Frank Harrell. You can read a description of the algorithm here:

http://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#procstat_corr_sect016.htm

And here is Harrell's code in Fortran (which is surprisingly easy to follow if you understand the calculation already):

http://hmisc.sourcearchive.com/documentation/3.7-0/hoeffd_8f-source.html

(also, page 128 of the pdf documentation for Hmisc has some more details.)

This is a very computationally demanding algorithm-- if you wanted to apply it to a data set that is thousands of rows and a few thousand columns, even with the fast Fortran implementation, you would be waiting many days for the result-- even on a new machine. My hope is that using an Nvidia GTX 580 level card, or better yet, a Tesla, would bring that down to a couple hours. In which case, the combination would be an analytical force to be reckoned with, whether identifying genes or finding correlations in experimental data.

Anyway, I look forward to peoples' responses and hope we can make a fast, GPU based algorithm for Hoeffding's D a reality.

Thanks in advance for any ideas-- and please don't hesitate to give partial or half-baked ideas!

Update: So, Jascha has generously provided a working implementation of Hoeffding's D in Matlab, which fulfilled one of my objectives. The other was to radically enhance the speed by using a GPU, preferably using Jacket. Does anyone see a path or strategy to do this in a clever way on the GPU?


Solution

  • Below is a code example with a simple implementation of the Hoeffding's D measure of dependency in MATLAB. This is NOT GPU-ized, but may prove useful to people who want to calculate this statistic and are not using Fortran, or as a starting point for putting it on the GPU. (The extended header illustrates Hoeffding's D values for several types of joint distributions.)

    function [ D ] = hoeffdingsD( x, y )
    %Compute's Hoeffding's D measure of dependence between x and y
    %   inputs x and y are both N x 1 arrays
    %   output D is a scalar
    %     The formula for Hoeffding's D is taken from
    %     http://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#procstat_corr_sect016.htm
    % Below is demonstration code for several types of dependencies.
    %
    % % this case should be 0 - there are no dependencies
    % x = randn(1000,1);
    % y = randn(1000,1);
    % D = hoeffdingsD( x, y );
    % desc = 'x = N( 0, 1 ), y and x independent';
    % desc = sprintf( '%s, Hoeffding''s D = %f', desc, D );
    % fprintf( '%s\n', desc );
    % figure; plot(x,y,'.'); title( desc );
    % 
    % % the rest of these cases have dependencies of different types
    % x = randn(1000,1);
    % y = x;
    % D = hoeffdingsD( x, y );
    % desc = 'x = N( 0, 1 ), y = x';
    % desc = sprintf( '%s, Hoeffding''s D = %f', desc, D );
    % fprintf( '%s\n', desc );
    % figure; plot(x,y,'.'); title( desc );
    % 
    % x = randn(1000,1);
    % y = cos(10*x);
    % D = hoeffdingsD( x, y );
    % desc = 'x = N( 0, 1 ), y = cos(10x)';
    % desc = sprintf( '%s, Hoeffding''s D = %f', desc, D );
    % fprintf( '%s\n', desc );
    % figure; plot(x,y,'.'); title( desc );
    % 
    % x = randn(1000,1);
    % y = x.^2;
    % D = hoeffdingsD( x, y );
    % desc = 'x = N( 0, 1 ), y = x^2';
    % desc = sprintf( '%s, Hoeffding''s D = %f', desc, D );
    % fprintf( '%s\n', desc );
    % figure; plot(x,y,'.'); title( desc );
    % 
    % x = randn(1000,1);
    % y = x.^2 + randn(1000,1);
    % D = hoeffdingsD( x, y );
    % desc = 'x = N( 0, 1 ), y = x^2 + N( 0, 1 )';
    % desc = sprintf( '%s, Hoeffding''s D = %f', desc, D );
    % fprintf( '%s\n', desc );
    % figure; plot(x,y,'.'); title( desc );
    % 
    % x = rand(1000,1);
    % y = rand(1000,1);
    % z = sign(randn(1000,1));
    % x = z.*x; y = z.*y;
    % D = hoeffdingsD( x, y );
    % desc = 'x = z U( [0, 1) ), y = z U( [0, 1) ), z = U( {-1,1} )';
    % desc = sprintf( '%s, Hoeffding''s D = %f', desc, D );
    % fprintf( '%s\n', desc );
    % figure; plot(x,y,'.'); title( desc );
    % 
    % x = rand(1000,1);
    % y = rand(1000,1);
    % z = sign(randn(1000,1));
    % x = z.*x; y = -z.*y;
    % D = hoeffdingsD( x, y );
    % desc = 'x = z U( [0, 1) ), y = -z U( [0, 1) ), z = U( {-1,1} )';
    % desc = sprintf( '%s, Hoeffding''s D = %f', desc, D );
    % fprintf( '%s\n', desc );
    % figure; plot(x,y,'.'); title( desc );
    
    N = size(x,1);
    
    R = tiedrank( x );
    S = tiedrank( y );
    
    Q = zeros(N,1);
    parfor i = 1:N
        Q(i) = 1 + sum( R < R(i) & S < S(i) );
        % and deal with cases where one or both values are ties, which contribute less
        Q(i) = Q(i) + 1/4 * (sum( R == R(i) & S == S(i) ) - 1); % both indices tie.  -1 because we know point i matches
        Q(i) = Q(i) + 1/2 * sum( R == R(i) & S < S(i) ); % one index ties.
        Q(i) = Q(i) + 1/2 * sum( R < R(i) & S == S(i) ); % one index ties.
    end
    
    D1 = sum( (Q-1).*(Q-2) );
    D2 = sum( (R-1).*(R-2).*(S-1).*(S-2) );
    D3 = sum( (R-2).*(S-2).*(Q-1) );
    
    D = 30*((N-2)*(N-3)*D1 + D2 - 2*(N-2)*D3) / (N*(N-1)*(N-2)*(N-3)*(N-4));
    
    end