Search code examples
sqldata-manipulationcalculated-columnsvariance

SQL: get variance for multiple columns without using PIVOT


I have a datatable (dt) like the following in SQL:

ID        state_id     act  rd_1 rd_2    rd_3   rd_4  rd_5
11        abc,13.3      1    1.   31     17.4   32.4  0.4
11        afd,23.2      4    1.   42.1   1.3    31.9  0.39
11        dfw,154       7    0.   0.3    4.3    8.21  163.3
12        vab,64.5      8    1.   32.3   11     2.1   21.3
12        avf,542       2    0.   2.12   28.2   8.12  57.5
12        vjg,35        4    1.   5.7    8.64   7.46  0.25
13        vaw,424.2     4    1.   64.3   0.435  4.3   35.3
14        bda,243       1    0.   4.4    4.6    2.4   4.2
15        rbe,24.2      3    1.   43     53.5   4.4   8.5

I want to, for each row, calculate the variance of values from rd_1 to rd_5 (they are doubles). ID and state_id uniquely identifies a row. The desired output is the like the following:

ID        state_id     act  rd_1 rd_2    rd_3   rd_4  rd_5.   var_rd
11        abc,13.3      1    1.   31     17.4   32.4  0.4    192.6624
11        afd,23.2      4    1.   42.1   1.3    31.9  0.39   323.3181
11        dfw,154       7    0.   0.3    4.3    8.21  163.3  4109.9855
12        vab,64.5      8    1.   32.3   11     2.1   21.3   141.3463
13        vaw,424.2     4    1.   64.3   0.435  4.3   35.3   636.2333
14        bda,243       1    0.   4.4    4.6    2.4   4.2    3.0496
15        rbe,24.2      3    1.   43     53.5   4.4   8.5    473.2456

I know it is possible to use pivot to flatten the data and then calculate variance on column (rd_value) in the flattened data. But the SQL I use do not support Pivot method. I tried using UNION but it appears that it messes up with user_id.


Solution

  • I would approach this just by applying the formula for variance:

    select t.*,
           ( (rd_1 - rd_avg) * (rd_1 - rd_avg) +
             (rd_2 - rd_avg) * (rd_2 - rd_avg) +
             (rd_3 - rd_avg) * (rd_3 - rd_avg) +
             (rd_4 - rd_avg) * (rd_4 - rd_avg) +
             (rd_5 - rd_avg) * (rd_5 - rd_avg)
           )/4 as variance 
    from (select t.*, 
                 (rd_1 + rd_2 + rd_3 + rd_4 + rd_5) / 5 as rd_avg -- This assumes you always have all 5 columns filled
          from t
         ) t