Select rows with most similar set of attributes

I have a PostgreSQL 8.3.4 DB to keep information about photo taggings.

First off, my table definitions:

create table photos (
   id integer
 , user_id integer
 , primary key (id, user_id)
); 

create table tags (
   photo_id integer
 , user_id integer
 , tag text
 , primary key (user_id, photo_id, tag)
);

What I'm trying to do + simple example:

I am trying to return all the photos that have at least k other photos with at least j common tags.

I. e., if Photo X has these tags (info field in the tags table):

gold 
clock
family

And photo Y has the next tags:

gold
sun 
family
flower

X and Y have 2 tags in common. For k = 1 and j = 2 X and Y will be returned.

What I have tried

    SELECT tags1.user_id , users.name, tags1.photo_id
    FROM users, tags tags1, tags tags2
    WHERE ((tags1.info = tags2.info) AND (tags1.photo_id != tags2.photo_id)
    AND (users.id = tags1.user_id))
    GROUP BY tags1.user_id, tags1.photo_id, tags2.user_id, tags2.photo_id, users.name
    HAVING ((count(tags1.info) = <j>) and (count(*) >= <k>))
    ORDER BY user_id asc, photo_id asc

My failed results:

When I tried to run it on those tables:

photos

photo_id       user_id
   0             0
   1             0
   2             0
   20            1
   23            1
   10            3

tags

photo_id     user_id       tag
   0           0            Car
   0           0            Bridge
   0           0            Sky
   20          1            Car
   20          1            Bridge
   10          3            Sky

The result for k = 1 and j = 1:
Expected:

|   user_id    |  User Name   |   photo_id   |
| 0            | Bob          | 0            |
| 1            | Ben          | 20           |
| 3            | Lev          | 10           |

Actual:

|   user_id    |  User Name   |   photo_id   |
| 0            | Bob          | 0            |
| 3            | Lev          | 10           |

For k = 2 and j = 1:
Expected:

|   user_id    |  User Name   |   photo_id   |
| 0            | Bob          | 0            |

Actual: empty result.

For j = 2 and k = 2:
Expected: empty result.

Actual:

|   user_id    |  User Name   |   Photo ID   |
| 0            | Bob          | 0            |
| 1            | Ben          | 20           |

How to solve this properly?

Solution

Working with your current design, this uses only basic SQL features and should work for Postgres 8.3, too (untested):

SELECT *
FROM   photos p
WHERE (
   SELECT count(*) >= 1  -- k other photos
   FROM  (
      SELECT 1
      FROM   tags t1
      JOIN   tags t2 USING (tag)
      WHERE  t1.photo_id =  p.id
      AND    t1.user_id  =  p.user_id
      AND   (t2.photo_id <> p.id OR
             t2.user_id  <> p.user_id)
      GROUP  BY t2.photo_id, t2.user_id
      HAVING count(*) >= 1  -- j common tags
      ) t1
   );

Or:

SELECT *
FROM  (
   SELECT id, user_id
   FROM  (
      SELECT t1.photo_id AS id, t1.user_id
      FROM   tags t1
      JOIN   tags t2 USING (tag)
      WHERE (t2.photo_id <> t1.photo_id OR
             t2.user_id  <> t1.user_id)
      GROUP  BY t1.photo_id, t1.user_id, t2.photo_id, t2.user_id
      HAVING count(*) >= 1    -- j common tags
      ) sub1
   GROUP  BY 1, 2
   HAVING count(*) >= 1  -- k other photos
   ) sub2
JOIN   photos p USING (id, user_id);

In Postgres 9.3 or later you could use a correlated subquery with a LATERAL join ... The above are probably even faster than my first query:

SELECT *
FROM  (
   SELECT photo_id, user_id
   FROM   tags t
   GROUP  BY 1, 2
   HAVING (
      SELECT count(*) >= 1
      FROM  (
         SELECT photo_id, user_id
         FROM   tags
         WHERE  tag = ANY(array_agg(t.tag))
         AND   (photo_id <> t.photo_id OR
                user_id  <> t.user_id)
         GROUP  BY 1, 2
         HAVING count(*) >= 2
         ) t1
      )
   ) t
JOIN photos p ON p.id = t.photo_id
             AND p.user_id = t.user_id;

SQL Fiddle showing both on Postgres 9.3.

The 1st query just needs the right basic indexes.

For the 2nd, I would build a materialized view with integer arrays, install the intarray module, a GIN index on the integer array column for better performance ...
Related:

Order result by count of common array elements

Proper design

It would be much more efficient to have a single column serial PK for photos and only store IDs of tags per photo ...:

CREATE TABLE  photo (
   photo_id serial PRIMARY KEY
 , user_id  int NOT NULL
); 

CREATE TABLE  tag (
   tag_id serial PRIMARY KEY 
 , tag    text UNIQUE NOT NULL
);

CREATE TABLE  photo_tag (
   photo_id int REFERENCES (photo)
 , tag_id   int REFERENCES (tag)
 , PRIMARY KEY (photo_id, tag_id)
);

Would make the query much simpler and faster, too.

How to implement a many-to-many relationship in PostgreSQL?