Search code examples
hpcc-ecl

Iterate using the dataset columns


I have a dataset with more than 30 columns and I would like to calculate the correlation between each of them (and preferably put the results in a correlation matrix). Is there any way to create a loop that runs the CORRELATION() function for all possible pairs of columns?

An example to make it clearer: Given a dataset of columns y, x1, x2 and x3, I would like to create the following table with the correlations between each one:

CorrelationMatrix


Solution

  • Here's an example of how to accomplish that:

    filelayout := RECORD
      UNSIGNED myid;
      REAL y;
      REAL x1;
      REAL x2;
      REAL x3;
    END;
    GFred := DATASET([ {1, 1.1, 2.2, 3.3, 4.4},
                       {2, 5, 6, 7, 8},
                       {3, 9, 10, 11, 12},
                       {4, 13.1, 14.2, 15.3, 16.4}],
                     filelayout);
                                 
    SetVals := [GFred.y,GFred.x1,GFred.x2,GFred.x3];
    OutRecCnt := COUNT(SetVals);
    OneRec := DATASET([{0}],{UNSIGNED1 h});
    filelayout XF(OneRec L, INTEGER C) := TRANSFORM
        SELF.myid := C;
        SELF.y := CORRELATION(GFred,SetVals[C],y);
        SELF.x1 := CORRELATION(GFred,SetVals[C],x1);
        SELF.x2 := CORRELATION(GFred,SetVals[C],x2);
        SELF.x3 := CORRELATION(GFred,SetVals[C],x3);
    END;
    P := NORMALIZE(OneRec,OutRecCnt,XF(LEFT,COUNTER));
    
    OUTPUT(P);