Search code examples
mysqldatabaseprofilingsimilarity

Compare similarities between two result sets


I am creating a music website where I would like users to be able to find users who like approximately the same artists as they.

I have a 'like' table that has 2 columns 'id_user', 'id_artist'. Here is an example of how I would like it to work:

User 1 likes:
1, 12
1, 13
1, 14
1, 26
1, 42
1, 44

User 2 likes:
2, 13
2, 14
2, 15
2, 26
2, 42
2, 56

Those 2 users have 4 artists in common. Is there a way, to compare those 2 results sets, to find the most similar people in the database?

My first idea was to concatenate likes in that way: "12,13,14,26,42,44" in a string, and use mysql FULLTEXT scores to compare different strings. That didn't work... don't know why but mysql fulltext only works with text... not with numbers...

Any idea or any clue would be much appreciated.


Solution

  • Something like this:

    SELECT first_user.id_user, second_user.id_user, COUNT(first_user.id_user) AS total_matches
    
    FROM likes AS first_user
    
    JOIN likes AS second_user
    ON second_user.id_artist = first_user.id_artist
    AND second_user.id_user != first_user.id_user
    
    GROUP BY first_user.id_user, second_user.id_user
    
    ORDER BY total_matches DESC
    
    LIMIT 1
    

    Note that this isn't very efficient. One way to work around this is to make a 'cache table' containing the output of this query with the LIMIT 1 portion removed. Add some relevant indexes and do query this cache table. You could set a cron job to update this table periodically.

    Example:

    CREATE TABLE IF NOT EXISTS `likes` (
      `id_user` varchar(50) DEFAULT NULL,
      `id_artist` varchar(50) DEFAULT NULL
    ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
    
    INSERT INTO `likes` (`id_user`, `id_artist`) VALUES ('8', '39'), ('8', '37'), ('4', '37'), ('8', '24'), ('8', '7'), ('4', '28'), ('8', '28'), ('4', '27'), ('4', '11'), ('8', '49'), ('4', '7'), ('4', '40'), ('4', '29'), ('8', '22'), ('4', '29'), ('8', '11'), ('8', '28'), ('4', '7'), ('4', '31'), ('8', '42'), ('8', '25'), ('4', '25'), ('4', '17'), ('4', '32'), ('4', '46'), ('4', '19'), ('8', '34'), ('3', '32'), ('4', '21')
    
    +---------+---------+---------------+
    | id_user | id_user | total_matches |
    +---------+---------+---------------+
    | 8       | 4       |             7 |
    +---------+---------+---------------+