Search code examples
c#similarityinformation-retrieval

Find Top X most similar users, based on Y boolean attributes


Let's say I have 5 users, each with 5 boolean attributes, which could look like this:

        | A   B   C   D   E
 --------------------------
 User 0 | 1   1   0   1   0
 User 1 | 0   1   0   1   0
 User 2 | 0   0   1   0   1
 User 3 | 1   1   0   0   0
 User 4 | 0   0   0   1   0

Now what would be the best approach to get a list of the top x users with the most "trues" in common. So in the example above the ranking should look like his:

Top 1: Users 0 (most true attributes)
Top 2: Users 0 and 1 OR Users 0 and 3 (both pairs have 2 attributes in common)
Top 3: Users 0, 1 and 3
Top 4: Users 0, 1, 3 and 4
Top 5: Users 0, 1, 2, 3, 4

I know there are metrics and distance measures to tell how similar two users are, but i want a list of most similar ones. Should i use some kind of clustering algorithm? But which one would consider multiple binary attributes and how could I implement it (preferably in C#)?

Since I haven't taken any classes on data mining, the literature on this topic is kinda overwhelming, so any help is highly appreciated.


Solution

  • User mostTrueUser = Users
            .OrderByDescending(u => (u.A?1:0) + (u.B?1:0) + (u.C?1:0) + (u.D?1:0) + (u.E?1:0))
            .First();
    
    var groups = Users.GroupBy(u => ((u.A && mostTrueUser.A)?1:0)
                                   +((u.B && mostTrueUser.B)?1:0)
                                   +((u.C && mostTrueUser.C)?1:0)
                                   +((u.D && mostTrueUser.D)?1:0)
                                   +((u.E && mostTrueUser.E)?1:0)
                              ,u => u).OrderByDescending(g => g.Key);
    foreach(var group in groups)
    {
        Console.WriteLine("{0}  // following have {0} 'true' in common with {1}",
                          group.Key,
                          mostTrueUser.ID);
        foreach(var g in group)
        {
            Console.WriteLine("  " + g.ID);
        }
    }
    

    This gives me the following:

    3   // following have 3 'true' in common with 0
      0
    2   // following have 2 'true' in common with 0
      1
      3
    1   // following have 1 'true' in common with 0
      4
    0   // following have 0 'true' in common with 0
      2
    

    Explanations

    I used u.A?1:0 so true becomes 1 and false becomes 0.
    I then got the User with most true using OrderByDescending([sum of trues]).
    Then the GroupBy is used to group all Users on the number of true in common with the mostTrueUser.


    Your ranking seems a little bit more complicated, but you can start with this to solve it.


    I wrote a little tweak:

    public class UserRank
    {
        public User UserA{get;set;}
        public User UserB{get;set;}
        public int Compare{
            get{return ((UserA.A && UserB.A)?1:0)
                      +((UserA.B && UserB.B)?1:0)
                      +((UserA.C && UserB.C)?1:0)
                      +((UserA.D && UserB.D)?1:0)
                      +((UserA.E && UserB.E)?1:0);}
        }
    }
    

    and then:

    List<UserRank> userRanks = new List<UserRank>();
    for(int i=0;i<Users.Count;i++)
    {
        for(int j=i;j<Users.Count;j++)
        {
            userRanks.Add(new UserRank
            {
                UserA = Users[i],
                UserB = Users[j]
            });
        }
    }
    
    var groups = userRanks.GroupBy(u => u.Compare, u => u).OrderByDescending(g => g.Key);
    
    foreach(var group in groups)
    {
        Console.WriteLine("{0} in common:",group.Key);
    
        foreach(var u in group)
        {
            Console.WriteLine("  {0}-{1}",u.UserA.ID,u.UserB.ID);
        }
    }
    

    gives me:

    3 in common:
      0-0
    2 in common:
      0-1
      0-3
      1-1
      2-2
      3-3
    1 in common:
      0-4
      1-3
      1-4
      4-4
    0 in common:
      0-2
      1-2
      2-3
      2-4
      3-4
    

    TutorialsPoint CodingGround permalink for testing purpose