Algorithm - group users based on most similar preferences

I'm developing an application where users are grouped in teams of 'n' based on a set of questions they answered. The set of questions is a basic multiple choice and each user has to answer each question.

These are the hard criteria:

Each group can have up to 'n' users
Each user can only be assigned to 1 team
Each team consist of users that have a high similarity

the datasets I am using are looking as followed (I can change this tho)

{
 1: { 1: 'a', 2: 'b', 3: 'a', 4: 'c' },
 2: { 1: 'b', 2: 'c', 3: 'b', 4: 'd' },
 3: { 1: 'b', 2: 'a', 3: 'c', 4: 'd' },
 ...
}

In my first attempt : I created a function that will, when given an initial user, return a set of users in order of similarity. This worked fine but does not provide solid groups.

In my second attempt : I tried to define a lower and upper precision. I then recursively looped over all users and pushed them on the team where the joined similarity of the members was higher then the upper precision. If not I would adjust the precision in the next iteration. This gave me solid groups but the users in each group are not as accurate as it should/could? be.

I'm now looking into actual algorithms, specifically the Gale-Shapely Algorithm In order to solve my problem. Yet given the fact i'm a developer and not a data scientist the details are lost on me.

Any advice or solutions for my problem are well appreciated.

Solution

This is a very hard problem, but here is a formalization that may help you.

Say you have N users. You can view them as a complete graph of N nodes, where the weight of edge (i, j) is the "similarity" of users i and j (number of common answers for instance). You are then looking for a partition of the vertices into groups of size n that maximizes intra-partition edge weight, i.e. a partition P maximizing

$M(P)=\sum_{i=1}^k \sum_{u,v\in P_i} w\left( u,v \right)$

This can be proved to be the same as minimizing inter-partition weights.

This transformation makes your problem boil down to finding a (k, nu)-balanced graph partitionning. This is a hard problem, but Balanced Graph partitionning, Andreev and Räcke, ACM 2004 provides an approximation algorithm and its detailled complexity analysis.

You can use that algorithm with a loose value for nu, get an approximate answer, and then rebalance users to get exactly n in each group. This will hopefully give close-to-optimal results, but optimality will be very hard to achieve.