Search code examples
pythonlistgeneratorpermutation

Large dataset and finding permutations matching various criteria


I have a list of football players with length 15000 which consists of dicts (same size all). An element in the list looks like this:

{
    'id': '123456',
    'name': 'Foo Bar',
    'position': 'GK',
    'club':  'Python FC',
    'league': 'Champions League',
    'country': 'Neverland'
}

Given a team which consists of 11 players, where each position is specified and must be filled. A formation may look like this:

formation = [('ST', 2), ('LM', 1), ('RM', 1), ('CM', 2), ('LB', 1), ('RB', 1), ('CB', 2), ('GK', 1)]

I would like to find all possible combinations/permutations of players matching the following criteria:

  • formation
  • players from at least 5 different countries
  • maximum 4 from the same club

What I have tried

Assuming my_list is the list of all players;

Attempt 1

from itertools import *
squads = permutations(my_list,11)
match_countries = [squad for squad in squads if 
    Counter([player['country'] for player in squad]).most_common().__len__() >= 5
    and Counter([player['club'] for player in squad]).most_common().__len__() <= 4
  ]  

But this will take WAY too long, because I have mal-formed squads.

Attempt 2 Still using the same list of players initially, I split it up in per-position lists. So I have a list of players per position, like this:

list_goalkeepers = [player for player in my_list if player['position'] == 'GK']

Then using these lists I make squads. Example a squad with 2 strikers and a goalkeeper would look like this:

squads = product(list_goalkeepers, list_strikers, list_strikers, .....)

This results still in a huge number of squads, but at least they are all valid; when I try to find a match for number of countries, it'll still iterate through all squads to check if there is a match. I perform the country search like this:

match_countries = [squad for squad in collection 
  if Counter([player['country'] for player in squad]).most_common().__len__() >= 5 ]

Is there any way to do this fast(er)? This is tediously slow.


Solution

  • Here's a couple of things which will help, but they won't reduce this to a tractable problem (see below).

    First,

    squads = product(list_goalkeepers, list_strikers, list_strikers, .....)
    

    is not actually correct. product([striker1, striker2], [striker1, striker2]) (to just look a small bit of that product) generates four possibilities:

    [striker1, striker1]
    [striker1, striker2]
    [striker2, striker1]
    [striker2, striker2]
    

    Of those, two are incorrect because the same player is included twice in a squad, and the other two are duplicates because the order of the players in each squad is irrelevant. So there is actually only one legal combination, {striker1, striker2}. To get that, you need itertools.combinations(strikers, 2).

    If the list A has n elements, product(A, A) will produce n² lists, whereas combinations(A, 2) will produce (n²-n)/2 lists, about half the number. Since you have three positions with two players, your product invocation generates a bit more than 8 times too many squads. So getting it right will speed things up quite a bit. But it's not quite a simple as adding some calls to combination. What you need to do is something like this:

    from collections import defaultdict
    from itertools import product, combinations, chain
     
    position_players = defaultdict(list)
    for player in all_players:
        position_players[player['position']].append(player)
    def flatten(list_of_lists):
        return [*chain.from_iterable(list_of_lists)]
    # See below for more general solution
    candidates = [*map(flatten,
                       product(
                           product(*(position_players[pos]
                                     for pos in ('LM', 'RM', 'LB', 'RB', 'GK'))),
                           *(combinations(position_players[pos], 2)
                             for pos in ('ST', 'CM', 'CB'))))]
    

    A more general solution would use formations to construct the final product, but I think the above is already hard enough to read :-). Still, for what it's worth:

    candidates = [*map(flatten,
                       product(
                           *(combinations(position_players[posn], count)
                             for posn, count in formation)))]
    

    Secondly, you seem to be implementing both of the other criteria, the maximum number of players per club and the minimum number of countries, using the same formula involving counter.most_common().__len__(). Leaving aside the question of why you directly call the __len__ dunder instead of just using the more natural len(counter.most_common()), this formulation is either incorrect or inefficient:

    • Counter([player['club'] for player in squad]).most_common().__len__() <= 4 checks whether there are at most four clubs represented in the squad. But the criterion you have is that no club be represented by more than four players, which would be Counter([player['club'] for player in squad]).most_common(1) <= 4.

    • Counter([player['country'] for player in squad]).most_common().__len__() >= 5 does check that there are at least five countries represented. But so does the much simpler (and somewhat faster): len(set(player['country'] for player in squad)) >= 5

    Fixing those will make the list correct, and speed the solution up considerably. But it won't really help.

    As with many combinatorial problems, it's easy to underestimate the number of possible candidates and thus formulate completely impractical solutions which involve generating every possibility.

    As a quick illustration, let's suppose 5001 players are distributed roughly proportionately between positions: 1364 candidates for each of the 5 unitary positions ('LM', 'RM', 'LB', 'RB', 'GK') and 2727 candidates for each of the remaining three dual positions ('ST', 'CM', 'CB'). There are then 1364⁵ * (2727 * 2726 / 2)³ possible squads, leaving aside the country/club criterion which probably don't eliminate the majority of possibilities. That works out to 901,147,384,847,503,556,419,043,700,514,410,951,988,224 possible squads. I think it's safe to say that iterating over all of those is not just "agonizingly slow". It's impossible to do within your lifetime (or, indeed, the predicted lifetime of the planet).

    You are probably better advised to find a way of creating and using a random sample of a tractable size, selected uniformly from the universe of possibilities.