Search code examples
mysqlquery-optimization

Search entity by tag and return all other tags as well in MySQL


In my case an entity is a recipe and a tag an ingredient. I want to search by ingredient and return all recipes containing the ingredient and return all other ingredients as well. In my case ingredients can have synonyms.

First, here are my tables:

CREATE TABLE `ingredient` (
  `id` int NOT NULL AUTO_INCREMENT,
  `name` varchar(45) DEFAULT NULL,
  `synonym_group` int DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `name_UNIQUE` (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

CREATE TABLE `recipe` (
  `id` int NOT NULL AUTO_INCREMENT,
  `name` varchar(100) DEFAULT NULL,
  `type` varchar(45) DEFAULT NULL,
  `comment` varchar(200) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

CREATE TABLE `recipe_ingredient` (
  `recipe_id` int NOT NULL,
  `ingredient_id` int NOT NULL,
  `alternative_ingredient_id` int DEFAULT NULL,
  `is_optional` tinyint(1) DEFAULT '0',
  PRIMARY KEY (`recipe_id`,`ingredient_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

Here is the query I've come up with:

SELECT 
  r.name, 
  i_all.name

FROM
  recipe r
  
JOIN (
    SELECT DISTINCT 
        recipe_id 
    FROM 
        ingredient i 
        JOIN recipe_ingredient ri ON i.id = ri.ingredient_id 
        LEFT JOIN ingredient i_syn ON i.synonym_group = i_syn.synonym_group 
    WHERE i.name = 'Asd' OR i_syn.name = 'Asd'
) i ON i.recipe_id = r.id 

JOIN recipe_ingredient ri ON i.recipe_id = ri.recipe_id 
JOIN ingredient i_all ON ri.ingredient_id = i_all.id

I feel like this query is very complicated for what I want to do and there is a much simpler way. Mainly it feels weird that I have to specify the search term twice in the query. Right now, if I only filter by i.name = 'Asd' I won't find any synonyms and if I only filter by i_syn.name = 'Asd' I won't find anything if the search term has no synonyms. Removing the distinct doesn't seem to change anything.

Is there just no better way or am I thinking too complicated?

Example data for ingredients:

id name synonym_group
1 Bell Pepper 1
2 Capsicum 1
3 Egg NULL
4 Sugar NULL

Solution

  • You say that -

    Removing the distinct doesn't seem to change anything.

    which suggests your test data is inadequate. The following illustrates the importance of DISTINCT in your current query. Given your ingredients listed above and the below recipes -

    id name type comment
    1 recipe 1 type comment
    2 recipe 2 type comment
    3 recipe 3 type comment
    4 recipe 4 type comment
    5 recipe 5 type comment
    recipe_id ingredient_id alternative_ingredient_id is_optional
    1 1 NULL 0
    1 3 NULL 0
    2 1 NULL 0
    2 4 NULL 0
    3 2 NULL 0
    3 3 NULL 0
    3 4 NULL 0
    4 2 NULL 0
    4 4 NULL 0
    5 1 NULL 0
    5 3 NULL 0
    5 4 NULL 0

    Your inner query -

    SELECT DISTINCT 
        recipe_id 
    FROM 
        ingredient i 
        JOIN recipe_ingredient ri ON i.id = ri.ingredient_id 
        LEFT JOIN ingredient i_syn ON i.synonym_group = i_syn.synonym_group
    WHERE i.name = 'Bell Pepper' OR i_syn.name = 'Bell Pepper';
    

    returns -

    recipe_id
    1
    2
    3
    4
    5

    but if you drop the DISTINCT it returns -

    recipe_id
    1
    1
    2
    2
    3
    4
    5
    5

    If you drop DISTINCT but add an additional join predicate to stop it LEFT JOINing to itself you get the desired result -

    SELECT
        recipe_id 
    FROM 
        ingredient i 
        JOIN recipe_ingredient ri ON i.id = ri.ingredient_id 
        LEFT JOIN ingredient i_syn ON i.synonym_group = i_syn.synonym_group AND i.id <> i_syn.id
    WHERE i.name = 'Bell Pepper' OR i_syn.name = 'Bell Pepper';
    

    More important than the query is the need to have the inverse index on the junction table -

    ALTER TABLE recipe_ingredient
        ADD INDEX IDX_ingredient_recipe (ingredient_id, recipe_id);
    

    Your current query can be improved by nesting the ingredients list in another derived table and modified to allow for searches based on multiple ingredients with a best match type ranking. This is a bit of a hybrid of your original query and the solution proposed by P.Salmon -

    SELECT r.name, i.name
    FROM (
        SELECT recipe_id, COUNT(DISTINCT ri.ingredient_id) AS num_matched
        FROM recipe_ingredient ri
        JOIN (
            SELECT COALESCE(i2.id, i1.id) AS ingredient_id
            FROM ingredient i1
            LEFT JOIN ingredient i2 ON i1.synonym_group = i2.synonym_group
            WHERE i1.name IN ( 'capsicum', 'onions' )
        ) all_syn ON ri.ingredient_id = all_syn.ingredient_id
        GROUP BY recipe_id
        ORDER BY num_matched DESC
        LIMIT 0, 10
    ) mr
    JOIN recipe r ON mr.recipe_id = r.id
    JOIN recipe_ingredient ri ON r.id = ri.recipe_id
    JOIN ingredient i ON ri.ingredient_id = i.id
    ORDER BY mr.num_matched DESC, r.name ASC;
    

    Here's a db<>fiddle

    This query does not rely on the synonym_group being the id of one of its members.

    Note the GROUP BY... ORDER BY... and LIMIT clauses on the middle layer query. With my test dataset of 39,774 recipes, 6,702 ingredients and 428,249 the server-side execution time is 13-16ms

    Here's a little db<>fiddle to illustrate how the innermost query works to get the ingredients list.