Search code examples
sqlpostgresqlrelational-division

Select rows with most similar set of attributes


I have a PostgreSQL 8.3.4 DB to keep information about photo taggings.

First off, my table definitions:

create table photos (
   id integer
 , user_id integer
 , primary key (id, user_id)
); 

create table tags (
   photo_id integer
 , user_id integer
 , tag text
 , primary key (user_id, photo_id, tag)
);

What I'm trying to do + simple example:

I am trying to return all the photos that have at least k other photos with at least j common tags.

I. e., if Photo X has these tags (info field in the tags table):

gold 
clock
family

And photo Y has the next tags:

gold
sun 
family
flower 

X and Y have 2 tags in common. For k = 1 and j = 2 X and Y will be returned.

What I have tried

    SELECT tags1.user_id , users.name, tags1.photo_id
    FROM users, tags tags1, tags tags2
    WHERE ((tags1.info = tags2.info) AND (tags1.photo_id != tags2.photo_id)
    AND (users.id = tags1.user_id))
    GROUP BY tags1.user_id, tags1.photo_id, tags2.user_id, tags2.photo_id, users.name
    HAVING ((count(tags1.info) = <j>) and (count(*) >= <k>))
    ORDER BY user_id asc, photo_id asc

My failed results:

When I tried to run it on those tables:

photos

photo_id       user_id
   0             0
   1             0
   2             0
   20            1
   23            1
   10            3

tags

photo_id     user_id       tag
   0           0            Car
   0           0            Bridge
   0           0            Sky
   20          1            Car
   20          1            Bridge
   10          3            Sky

The result for k = 1 and j = 1:
Expected:

|   user_id    |  User Name   |   photo_id   |
| 0            | Bob          | 0            |
| 1            | Ben          | 20           |
| 3            | Lev          | 10           |

Actual:

|   user_id    |  User Name   |   photo_id   |
| 0            | Bob          | 0            |
| 3            | Lev          | 10           |

For k = 2 and j = 1:
Expected:

|   user_id    |  User Name   |   photo_id   |
| 0            | Bob          | 0            |

Actual: empty result.

For j = 2 and k = 2:
Expected: empty result.

Actual:

|   user_id    |  User Name   |   Photo ID   |
| 0            | Bob          | 0            |
| 1            | Ben          | 20           |

How to solve this properly?


Solution

  • Working with your current design, this uses only basic SQL features and should work for Postgres 8.3, too (untested):

    SELECT *
    FROM   photos p
    WHERE (
       SELECT count(*) >= 1  -- k other photos
       FROM  (
          SELECT 1
          FROM   tags t1
          JOIN   tags t2 USING (tag)
          WHERE  t1.photo_id =  p.id
          AND    t1.user_id  =  p.user_id
          AND   (t2.photo_id <> p.id OR
                 t2.user_id  <> p.user_id)
          GROUP  BY t2.photo_id, t2.user_id
          HAVING count(*) >= 1  -- j common tags
          ) t1
       );
    

    Or:

    SELECT *
    FROM  (
       SELECT id, user_id
       FROM  (
          SELECT t1.photo_id AS id, t1.user_id
          FROM   tags t1
          JOIN   tags t2 USING (tag)
          WHERE (t2.photo_id <> t1.photo_id OR
                 t2.user_id  <> t1.user_id)
          GROUP  BY t1.photo_id, t1.user_id, t2.photo_id, t2.user_id
          HAVING count(*) >= 1    -- j common tags
          ) sub1
       GROUP  BY 1, 2
       HAVING count(*) >= 1  -- k other photos
       ) sub2
    JOIN   photos p USING (id, user_id);
    

    In Postgres 9.3 or later you could use a correlated subquery with a LATERAL join ... The above are probably even faster than my first query:

    SELECT *
    FROM  (
       SELECT photo_id, user_id
       FROM   tags t
       GROUP  BY 1, 2
       HAVING (
          SELECT count(*) >= 1
          FROM  (
             SELECT photo_id, user_id
             FROM   tags
             WHERE  tag = ANY(array_agg(t.tag))
             AND   (photo_id <> t.photo_id OR
                    user_id  <> t.user_id)
             GROUP  BY 1, 2
             HAVING count(*) >= 2
             ) t1
          )
       ) t
    JOIN photos p ON p.id = t.photo_id
                 AND p.user_id = t.user_id;
    

    SQL Fiddle showing both on Postgres 9.3.

    The 1st query just needs the right basic indexes.

    For the 2nd, I would build a materialized view with integer arrays, install the intarray module, a GIN index on the integer array column for better performance ...
    Related:

    Proper design

    It would be much more efficient to have a single column serial PK for photos and only store IDs of tags per photo ...:

    CREATE TABLE  photo (
       photo_id serial PRIMARY KEY
     , user_id  int NOT NULL
    ); 
    
    CREATE TABLE  tag (
       tag_id serial PRIMARY KEY 
     , tag    text UNIQUE NOT NULL
    );
    
    CREATE TABLE  photo_tag (
       photo_id int REFERENCES (photo)
     , tag_id   int REFERENCES (tag)
     , PRIMARY KEY (photo_id, tag_id)
    );
    

    Would make the query much simpler and faster, too.