Search code examples
postgresqlmany-to-manyrelational-division

How to find items with multiple tags on them?


Problem:

I have a table named item_tag_assn which maps items with tags (many-to-many association table). I need to find out items which have a set of tags applied to them. For example, if my table has following data:

 item_id | tag_id 
------------------
     205 | 110
     206 | 120
     207 | 130
     205 | 130
     206 | 147
     210 | 110
     205 | 152
     209 | 111
     210 | 177
     205 | 147
     212 | 110
     212 | 135
     205 | 135
     212 | 147
------------------

and If I am searching for

  • items with tags 110, 135 and 147 then I expect items #205 and #212 in the result set.
  • items with tags 110, 130, 135, 147 and 152 then I should get only item #205 because only #205 has all of those tags associated to it.

Environment:

  • PostgreSQL 9.5
  • I am not allowed to add a third column to this table or create new tables altogether.

Progress so far:

I have found a solution like this:

SELECT DISTINCT ita1.item_id 
FROM 
  item_tag_assn AS ita1 
  LEFT JOIN 
    item_tag_assn AS ita2 ON ita1.item_id = ita2.item_id 
  LEFT JOIN 
    item_tag_assn AS ita3 ON ita2.item_id = ita3.item_id 
GROUP BY ita1.item_id 
HAVING 
  sum((ita1.tag_id = 110 and ita2.tag_id = 135 and ita3.tag_id = 147)::integer) >= 1

and it works.


Optimization required

The association table is rather large. Joining it with itself is expensive and slows down, plus it is not very scalable. I think window functions can help but I do not know how to use them.

Is there any better way to solve the problem?


Solution

  • If I understand correctly you need something like this:

    WITH search AS (
        SELECT '{110,130,135,147,152}'::int4[] as search
    ), searched AS (
        SELECT DISTINCT item_id,
               tag_id
          FROM item_tag_assn
          JOIN search ON (tag_id) = ANY(search)
      ORDER BY 1, 2
    ), aggregated AS (
        SELECT item_id,
               array_agg(tag_id) AS agg
          FROM searched
      GROUP BY 1
    )
    SELECT *
      FROM aggregated, search
     WHERE agg = search
    ;
    

    search - is to set searched array (array must be presorted). searched - all rows than have searched tag aggregated - aggregated in array tag_id per item_id

    You can change agg = search to agg @> search and after that you do not need presorting and ORDER BY in searched.

    When add your dataset from question:

    WITH item_tag_assn AS (
          SELECT   205 as item_id, 110 as tag_id
          UNION SELECT     206 , 120
          UNION SELECT     207 , 130
          UNION SELECT     205 , 130
          UNION SELECT     206 , 147
          UNION SELECT     210 , 110
          UNION SELECT     205 , 152
          UNION SELECT     209 , 111
          UNION SELECT     210 , 177
          UNION SELECT     205 , 147
          UNION SELECT     212 , 110
          UNION SELECT     212 , 135
          UNION SELECT     205 , 135
          UNION SELECT     212 , 147
    ),search AS (
        SELECT '{110,130,135,147,152}'::int4[] as search
    ), searched AS (
        SELECT DISTINCT item_id,
               tag_id
          FROM item_tag_assn
          JOIN search ON (tag_id) = ANY(search)
      ORDER BY 1, 2
    ), aggregated AS (
        SELECT item_id,
               array_agg(tag_id) AS agg
          FROM searched
      GROUP BY 1
    )
    SELECT *
      FROM aggregated, search
     WHERE agg = search
    ;
    

    Result:

     item_id |          agg          |        search         
    ---------+-----------------------+-----------------------
         205 | {110,130,135,147,152} | {110,130,135,147,152}
    (1 row)
    

    If change search to '{110,135,147}':

     item_id |      agg      |    search     
    ---------+---------------+---------------
         212 | {110,135,147} | {110,135,147}
         205 | {110,135,147} | {110,135,147}
    (2 rows)
    

    For running on product you need to create index CREATE INDEX ON item_tag_assn (tag_id);

                                                                   QUERY PLAN                                                               
    ----------------------------------------------------------------------------------------------------------------------------------------
     Hash Join  (cost=32.72..35.34 rows=1 width=68) (actual time=0.055..0.059 rows=3 loops=1)
       Hash Cond: (aggregated.agg = search.search)
       CTE search
         ->  Result  (cost=0.00..0.01 rows=1 width=32) (actual time=0.001..0.001 rows=1 loops=1)
       CTE searched
         ->  Unique  (cost=27.73..28.55 rows=110 width=8) (actual time=0.029..0.031 rows=3 loops=1)
               ->  Sort  (cost=27.73..28.00 rows=110 width=8) (actual time=0.029..0.029 rows=3 loops=1)
                     Sort Key: x.item_id, x.tag_id
                     Sort Method: quicksort  Memory: 25kB
                     ->  Nested Loop  (cost=10.40..24.00 rows=110 width=8) (actual time=0.013..0.014 rows=3 loops=1)
                           ->  CTE Scan on search search_1  (cost=0.00..0.02 rows=1 width=32) (actual time=0.002..0.002 rows=1 loops=1)
                           ->  Bitmap Heap Scan on x  (cost=10.40..22.88 rows=110 width=8) (actual time=0.009..0.009 rows=3 loops=1)
                                 Recheck Cond: (tag_id = ANY (search_1.search))
                                 Heap Blocks: exact=1
                                 ->  Bitmap Index Scan on i1  (cost=0.00..10.38 rows=110 width=0) (actual time=0.002..0.002 rows=3 loops=1)
                                       Index Cond: (tag_id = ANY (search_1.search))
       CTE aggregated
         ->  HashAggregate  (cost=2.75..4.12 rows=110 width=36) (actual time=0.038..0.039 rows=3 loops=1)
               Group Key: searched.item_id
               ->  CTE Scan on searched  (cost=0.00..2.20 rows=110 width=8) (actual time=0.029..0.031 rows=3 loops=1)
       ->  CTE Scan on aggregated  (cost=0.00..2.20 rows=110 width=36) (actual time=0.040..0.043 rows=3 loops=1)
       ->  Hash  (cost=0.02..0.02 rows=1 width=32) (actual time=0.005..0.005 rows=1 loops=1)
             Buckets: 1024  Batches: 1  Memory Usage: 9kB
             ->  CTE Scan on search  (cost=0.00..0.02 rows=1 width=32) (actual time=0.000..0.001 rows=1 loops=1)
     Planning time: 0.309 ms
     Execution time: 0.115 ms