Search code examples
pythonsqlpeewee

Searching for items in a many-to-many relationship


I'm currently writing an application that allows one to store images, and then tag these images. I'm using Python and the Peewee ORM (http://charlesleifer.com/docs/peewee/), which is very similar to Django's ORM.

My data model looks like this (simplified):

class Image(BaseModel):
    key = CharField()

class Tag(BaseModel):
    tag = CharField()

class TagRelationship(BaseModel):
    relImage = ForeignKeyField(Image)
    relTag   = ForeignKeyField(Tag)

Now, I understand conceptually how to query for all Images that have a given set of tags:

SELECT Image.key
  FROM Image
INNER JOIN TagRelationship
    ON Image.ID = TagRelationship.ImageID
INNER JOIN Tag
    ON TagRelationship.TagID = Tag.ID
 WHERE Tag.tag
       IN ( 'A' , 'B' )     -- list of multiple tags
GROUP BY Image.key
HAVING COUNT(*) = 2         -- where 2 == the number of tags specified, above

However, I also want to be able to do more complex searches. Specifically, I'd like to be able to specify a list of "all tags" - i.e. an image must have all of the specified tags to be returned, along with a list of "any" and a list of "none".

EDIT: I'd like to clarify this a bit. Specifically, the above query is an "all tags"-style query. It returns Images that have all the given tags. I want to be able to specify something like: "Give me all images that have the tags (green, mountain), any one of the tags (background, landscape) but not the tags (digital, drawing)".

Now, ideally, I'd like this to be one SQL query, because pagination then becomes very easy with LIMIT and OFFSET. I've actually got an implementation working whereby I just load everything into Python sets and then use the various intersection operators. What I'm wondering is if there's a method of doing this all at once?

Also, for those interested, I've emailed the author of Peewee about how to represent the above query using Peewee, and he responded with the following solution:

Image.select(['key']).group_by('key').join(TagRelationship).join(Tag).where(tag__in=['tag1', 'tag2']).having('count(*) = 2')

Or, alternatively, a shorter version:

Image.filter(tagrelationship_set__relTag__tag__in=['tag1', 'tag2']).group_by(Image).having('count(*) = 2')

Thanks in advance for your time.


Solution

  • SELECT Image.key
      FROM Image
      JOIN TagRelationship
        ON Image.ID = TagRelationship.ImageID
      JOIN Tag
        ON TagRelationship.TagID = Tag.ID
     GROUP BY Image.key
    HAVING SUM(Tag.tag IN (mandatory tags )) = N  /*the number of mandatory tags*/
       AND SUM(Tag.tag IN (optional tags  )) > 0
       AND SUM(Tag.tag IN (prohibited tags)) = 0
    

    UPDATE

    A more universally accepted version of the above query (converts the boolean results of the IN predicates into integers using CASE expressions):

    SELECT Image.key
      FROM Image
      JOIN TagRelationship
        ON Image.ID = TagRelationship.ImageID
      JOIN Tag
        ON TagRelationship.TagID = Tag.ID
     GROUP BY Image.key
    HAVING SUM(CASE WHEN Tag.tag IN (mandatory tags ) THEN 1 ELSE 0 END) = N  /*the number of mandatory tags*/
       AND SUM(CASE WHEN Tag.tag IN (optional tags  ) THEN 1 ELSE 0 END) > 0
       AND SUM(CASE WHEN Tag.tag IN (prohibited tags) THEN 1 ELSE 0 END) = 0
    

    or with COUNTs instead of SUMs:

    SELECT Image.key
      FROM Image
      JOIN TagRelationship
        ON Image.ID = TagRelationship.ImageID
      JOIN Tag
        ON TagRelationship.TagID = Tag.ID
     GROUP BY Image.key
    HAVING COUNT(CASE WHEN Tag.tag IN (mandatory tags ) THEN 1 END) = N  /*the number of mandatory tags*/
       AND COUNT(CASE WHEN Tag.tag IN (optional tags  ) THEN 1 END) > 0
       AND COUNT(CASE WHEN Tag.tag IN (prohibited tags) THEN 1 END) = 0