Search code examples
phpmysqlrelationshiptagging

MySQL / PHP: Find similar / related items by tag / taxonomy


I have a cities table which looks like this.

|id| Name    |
|1 | Paris   |
|2 | London  |
|3 | New York|

I have a tags table which looks like this.

|id| tag            |
|1 | Europe         |
|2 | North America  |   
|3 | River          |

and a cities_tags table:

|id| city_id | tag_id |
|1 | 1       | 1      | 
|2 | 1       | 3      | 
|3 | 2       | 1      |
|4 | 2       | 3      | 
|5 | 3       | 2      |     
|6 | 3       | 3      |

How do I calculate which are the most closely related city? For example. If I were looking at city 1 (Paris), the results should be: London (2), New York (3)

I have found the Jaccard index but I'm unsure as how best to implement this.


Solution

  • You question about How do I calculate which are the most closely related city? For example. If I were looking at city 1 (Paris), the results should be: London (2), New York (3) and based on your provided data set there is only one thing to relate that is the common tags between the cities so the cities which shares the common tags would be the closest one below is the subquery which finds the cities (other than which is provided to find its closest cities) that shares the common tags

    SELECT * FROM `cities`  WHERE id IN (
    SELECT city_id FROM `cities_tags` WHERE tag_id IN (
    SELECT tag_id FROM `cities_tags` WHERE city_id=1) AND city_id !=1 )
    

    Working

    I assume you will input one of the city id or name to find their closest one in my case "Paris" has the id one

     SELECT tag_id FROM `cities_tags` WHERE city_id=1
    

    It will find all the tags id which paris has then

    SELECT city_id FROM `cities_tags` WHERE tag_id IN (
        SELECT tag_id FROM `cities_tags` WHERE city_id=1) AND city_id !=1 )
    

    It will fetch all the cities except paris that has the some same tags that paris also has

    Here is your Fiddle

    While reading about the Jaccard similarity/index found some stuff to understand about the what actualy the terms is lets take this example we have two sets A & B

    Set A={A, B, C, D, E}

    Set B={I, H, G, F, E, D}

    Formula to calculate the jaccard similarity is JS=(A intersect B)/(A union B)

    A intersect B = {D,E}= 2

    A union B ={A, B, C, D, E,I, H, G, F} =9

    JS=2/9 =0.2222222222222222

    Now move towards your scenario

    Paris has the tag_ids 1,3 so we make the set of this and call our Set P ={Europe,River}

    London has the tag_ids 1,3 so we make the set of this and call our Set L ={Europe,River}

    New York has the tag_ids 2,3 so we make the set of this and call our Set NW ={North America,River}

    Calculting the JS Paris with London JSPL = P intersect L / P union L , JSPL = 2/2 = 1

    Calculting the JS Paris with New York JSPNW = P intersect NW / P union NW ,JSPNW = 1/3 = 0.3333333333

    Here is the query so far which calcluates the perfect jaccard index you can see the below fiddle example

    SELECT a.*, 
    ( (CASE WHEN a.`intersect` =0 THEN a.`union` ELSE a.`intersect` END ) /a.`union`) AS jaccard_index 
     FROM (
    SELECT q.* ,(q.sets + q.parisset) AS `union` , 
    (q.sets - q.parisset) AS `intersect`
    FROM (
    SELECT cities.`id`, cities.`name` , GROUP_CONCAT(tag_id SEPARATOR ',') sets ,
    (SELECT  GROUP_CONCAT(tag_id SEPARATOR ',')  FROM `cities_tags` WHERE city_id= 1)AS parisset
    
    FROM `cities_tags` 
    LEFT JOIN `cities` ON (cities_tags.`city_id` = cities.`id`)
    GROUP BY city_id ) q
    ) a ORDER BY jaccard_index DESC 
    

    In above query i have the i have derived the result set to two subselects in order get my custom calculated aliases

    enter image description here

    You can add the filter in above query not to calculate the similarity with itself

    SELECT a.*, 
    ( (CASE WHEN a.`intersect` =0 THEN a.`union` ELSE a.`intersect` END ) /a.`union`) AS jaccard_index 
     FROM (
    SELECT q.* ,(q.sets + q.parisset) AS `union` , 
    (q.sets - q.parisset) AS `intersect`
    FROM (
    SELECT cities.`id`, cities.`name` , GROUP_CONCAT(tag_id SEPARATOR ',') sets ,
    (SELECT  GROUP_CONCAT(tag_id SEPARATOR ',')  FROM `cities_tags` WHERE city_id= 1)AS parisset
    
    FROM `cities_tags` 
    LEFT JOIN `cities` ON (cities_tags.`city_id` = cities.`id`) WHERE  cities.`id` !=1
    GROUP BY city_id ) q
    ) a ORDER BY jaccard_index DESC
    

    So the result shows Paris is closely related to London and then related to New York

    Jaccard Similarity Fiddle