Search code examples
sasproc-sqlcross-join

SAS: Improve efficiency of a cross join


In my project I am combining three unique input sources to generate one score. Imagine this formula

Integrated score = weight_1 * Score_1 + weight_2 * Score_2 + weight_3 * Score_3

So, to do this, I have utilised the following code

    DATA w_matrix_t;
     /*Create a row count to identify the model weight combination*/
         RETAIN model_combination;
         model_combination = 0;

         DO n_1 = 0 TO 100 BY 1;
               DO n_2 = 0 TO 100 BY 1;
                    IF (100 - n_1 - n_2) ge 0 AND (100 - n_1 - n_2) le 100 THEN DO;
                         n_3 = 100 - n_1 - n_2;
                         model_combination+1;
                         output;
                    END;
               END;
         END;
    RUN;

    DATA w_matrix;
        SET w_matrix_t;
        w_1 = n_1/100;
        w_2 = n_2/100;
        w_3 = n_3/100;
        /*Drop the old variables*/
        DROP n_1 n_2 n_3;
    RUN;

PROC SQL; 
    CREATE TABLE weights_added AS 
    SELECT
          w.model_combination
        , w.w_1
        , w.w_2
        , w.w_3
        , fit.name
        , fit.logsalary
        , (
            w.w_1*fit.crhits    + 
            w.w_2*fit.natbat    + 
            w.w_3*fit.nbb
        )                               AS y_hat_int
    FROM
        work.w_matrix                   AS w
    CROSS JOIN
        sashelp.baseball                AS fit
    ORDER BY
        model_combination;
QUIT;

My question is, is there a more efficient way of making this join? The purpose is to create a large table that contains the entire sashelp.baseball dataset duplicated for all combinations of weights.

In my live data, I have three input sources of 46,000 observations each and that cross join takes 1 hour. I also have three input sources of 465,000 each, I imagine this will take a very long time.

The reason I do it this way is because I calculate my Somers' D using Proc freq and by group processing (by model combination)


Solution

  • 5000 copies of a 500,000 row table will be a pretty big table with 2.5B rows

    Here is an example of data step stacking; one copy of have data set per row of weights. The example features SET weights to process each weight (via implicit loop) and SET have POINT= / OUTPUT inside an explicit loop (the inner loop). The inner loop copies the data while it computes the weighted sum.

    data have;
      set sashelp.baseball (obs=200);  * keep it small for demonstration;
    run;
    
    data weights (keep=comboId w1 w2 w3);
      do i = 0 to 100; do j = 0 to 100; if (i+j) <= 100 then do;
        comboId + 1;
        w1 = i / 100;
        w2 = j / 100;
        w3 = (100 - i - j) / 100;
        output; 
      end; end; end;
    run;
    
    
    data want (keep=comboid w1-w3 name logsalary y_hat_int);
      do while (not endOfWeights);
        set weights end = endOfWeights;
        do row = 1 to RowsInHave;
          set have (keep=name logsalary crhits natbat nbb) nobs = RowsInHave point = row;
          y_hat_int = w1 * crhits + w2 * natbat + w3 * nbb;
          output;
        end;
      end;
      stop;
    run;
    
    proc freq data=want noprint;
      by comboId;
      table y_hat_int / out=freqout ;
      format y_hat_int 4.;
    run;
    
    proc contents data=want;
    run;
    

    Off the cuff, a single table containing 5,151 copies of a 200 row extract from baseball is nominally 72.7MB, so expect 5,151 copies of a 465K row table to have ~ 2.4G rows and be ~ 170 GB disk. On a disk spinning @7200 achieving max performance throughout your looking at best 20 minutes just writing, and probably much more.