Search code examples
sqlpostgresqlrelational-division

SQL aggregate unique pairs


I have a PostgreSQL table that is mostly a bridge table but it also has some extra stuff.

Essentially it holds the information about players in a game. So we have a unique id for this instance of a player in a game. Then an id that is FK to game table, and an id that is FK to player table. There is also some other irrelevant stuff. Something like this:

Table players_games
| id        | 12564
| player_id | 556
| game_id   | 156184

What I want to do is find how many occurrences there are of a player playing with another. So, if player1 is in the same game as player2, they have played together once. There are 2+ players in a game.

So what I want to do is populate a new table, that holds three values: player_lo, player_hi, times_played.

And either have one row for each pair and the number of times they played, or if it ends up being more efficient, a row for each iteration and have the value set as 1 so these can be added together later, maybe distributed. So you might see something like:

p1, p2, 1
p1, p2, 1

And these get reduced later to:

p1, p2, 2

So I was wondering if there was some clever way to do this with SQL, or if there's SQL that can reduce my programming effort, before starting to write a slightly complex python script to do it.


Solution

  • To do this, you need to do a self join on the player_games table. The first subquery is for the first player, and the second for the second player. The "first" player is the one with the lower player id.

    select pg1.player_id as player1, pg2.player_id as player2, count(*) as num_games
    from (select distinct game_id, player_id
          from  players_games pg
         ) pg1 join
         (select distinct game_id, player_id
          from players_games pg
         ) pg2
         on pg1.game_id = pg2.game_id and
            pg1.player_id < pg2.player_id
    group by pg1.player_id, pg2.player_id
    

    Note that the join condition uses a "<" on the player ids. This is to prevent counting duplicates (so players A,B are not also counted as B,A).

    Also, I added a "distinct" in the inner subqueries just in case a single player might appear more than once for a given game. Perhaps this is not necessary. To be sure, you should have a unique index on the composite key game_id, player_id.