Search code examples
sqlsql-servert-sqlgroup-bygini

How to aggregate information from indefinite number of groups


How to aggregate information from indefinite number of groups in TSQL? E.g. we have a table with 2 columns - clients and regions.

Clients Regions
client1 45
client1 45
client1 45
client1 45
client1 43
client1 42
client1 41
client2 45
client2 45
client3 43
client3 43
client3 41
client3 41
client3 41
client3 41

Every client can have any number of regions.

In the example below: client1 has 4 groups of regions, 2nd - 1 group, 3rd - 2 groups.

I want to count gini impurity for each client, i.e. to count - how different are the regions in client.

To do this, I want to apply to each client the following formula:

1 - ((% of region1 among all the regions in the client) ^ 2 + 
     (% of region2 among all the regions in the client) ^ 2 + 
   … (% of regionN among all the regions in the client) ^ 2)

But the quantity of regions is indefinite (may differ in each client).

This should be calculated:

client1 = 1 - ((4 / 7 ) ^ 2 + (1 / 7 ) ^ 2 + (1 / 7 ) ^ 2  + (1 / 7 ) ^ 2)
client2 = 1 - ((2 / 2 ) ^ 2)
client3 = 1 - ((2 / 6 ) ^ 2 +  (4 / 6 ) ^ 2)

This is the desirable output:

Clients Impurity
client1 61%
client2 0%
client3 44%

Would you prompt me the way to solve the problem.


Solution

  • I think the formula could be expressed as a couple of group by:

    WITH cte AS (
        SELECT Clients
             , CAST(COUNT(*) AS DECIMAL(10, 0)) / SUM(COUNT(*)) OVER(PARTITION BY Clients) AS tmp
        FROM t
        GROUP BY Clients, Regions
    )
    SELECT Clients
         , 100 * (1 - SUM(tmp * tmp)) AS GI
    FROM cte
    GROUP BY Clients
    

    db<>fiddle seems to match expected output.