Search code examples
sqljoinmappingrelational-division

Join between mapping (junction) table with specific cardinality


I have a simple question about the most efficient way to perform a particular join.

Take these three tables, real names have been changed to protect the innocent:

Table: animal

animal_id   name   ...
======================
1           bunny
2           bear
3           cat
4           mouse

Table: tags

tag_id     tag
==================
1          fluffy
2          brown
3          cute
4          small

Mapping Table: animal_tag

animal_id   tag_id
==================
1           1
1           2
1           3
2           2
3           4
4           2

I want to find all animals that are tagged as 'fluffy', 'brown', and 'cute'. That is to say that the animal must be tagged with all three. In reality, the number of required tags can vary, but should be irrelevant for this discussion. This is the query I came up with:

SELECT * FROM animal
JOIN (
      SELECT at.animal_id FROM animal_tag at
      WHERE at.tag_id IN (
                          SELECT tg.tag_id FROM tag tg
                          WHERE tg.tag='fluffy' OR tg.tag='brown' OR tg.tag='cute'
                          )
      GROUP BY at.animal_id HAVING COUNT(at.tag_id)=3
      ) AS jt
ON animal.animal_id=jt.animal_id

On a table with thousands 'animals' and and hundreds of 'tags', this query performs respectably ... 10s of milliseconds. However, when i look at the query plan (Apache Derby is the DB), the optimizer's estimated cost is pretty high (9945.12) and the plan pretty extensive. For a query this "simple" I usually try to get query plans with an estimated cost of single or double digits.

So my question is, is there a better way to perform this query? Seems like a simple query, but I've been stumped coming up with anything better.


Solution

  • First of all, a huge thanks to everyone who jumped in on this. Ultimately the answer is, as referenced by several commenters, relational division.

    While I did take a course in Codd's relational data model many moons ago, the course like many, did not really cover relational division. Unwittingly, my original query is actually an application of Relational Division.

    Referring to a slide 26-27 in this presentation on relational division, my query applies the technique of comparing set cardinalities. I tried some of the other methods mentioned for applying relational division but, at least in my case, the counting method provides the fastest run-time. I encourage anyone interested in this problem to read the aforementioned slide stack, as well as the article referenced on this page by Mikael Eriksson. Again, thanks to everyone.