I have a PostgreSQL 8.3.4 DB to keep information about photo taggings.
First off, my table definitions:
create table photos (
id integer
, user_id integer
, primary key (id, user_id)
);
create table tags (
photo_id integer
, user_id integer
, tag text
, primary key (user_id, photo_id, tag)
);
What I'm trying to do + simple example:
I am trying to return all the photos that have at least k other photos with at least j common tags.
I. e., if Photo X has these tags (info field in the tags table):
gold
clock
family
And photo Y has the next tags:
gold
sun
family
flower
X and Y have 2 tags in common. For k = 1
and j = 2
X and Y will be returned.
What I have tried
SELECT tags1.user_id , users.name, tags1.photo_id
FROM users, tags tags1, tags tags2
WHERE ((tags1.info = tags2.info) AND (tags1.photo_id != tags2.photo_id)
AND (users.id = tags1.user_id))
GROUP BY tags1.user_id, tags1.photo_id, tags2.user_id, tags2.photo_id, users.name
HAVING ((count(tags1.info) = <j>) and (count(*) >= <k>))
ORDER BY user_id asc, photo_id asc
My failed results:
When I tried to run it on those tables:
photos
photo_id user_id
0 0
1 0
2 0
20 1
23 1
10 3
tags
photo_id user_id tag
0 0 Car
0 0 Bridge
0 0 Sky
20 1 Car
20 1 Bridge
10 3 Sky
The result for k = 1
and j = 1
:
Expected:
| user_id | User Name | photo_id |
| 0 | Bob | 0 |
| 1 | Ben | 20 |
| 3 | Lev | 10 |
Actual:
| user_id | User Name | photo_id |
| 0 | Bob | 0 |
| 3 | Lev | 10 |
For k = 2
and j = 1
:
Expected:
| user_id | User Name | photo_id |
| 0 | Bob | 0 |
Actual: empty result.
For j = 2
and k = 2
:
Expected: empty result.
Actual:
| user_id | User Name | Photo ID |
| 0 | Bob | 0 |
| 1 | Ben | 20 |
How to solve this properly?
Working with your current design, this uses only basic SQL features and should work for Postgres 8.3, too (untested):
SELECT *
FROM photos p
WHERE (
SELECT count(*) >= 1 -- k other photos
FROM (
SELECT 1
FROM tags t1
JOIN tags t2 USING (tag)
WHERE t1.photo_id = p.id
AND t1.user_id = p.user_id
AND (t2.photo_id <> p.id OR
t2.user_id <> p.user_id)
GROUP BY t2.photo_id, t2.user_id
HAVING count(*) >= 1 -- j common tags
) t1
);
Or:
SELECT *
FROM (
SELECT id, user_id
FROM (
SELECT t1.photo_id AS id, t1.user_id
FROM tags t1
JOIN tags t2 USING (tag)
WHERE (t2.photo_id <> t1.photo_id OR
t2.user_id <> t1.user_id)
GROUP BY t1.photo_id, t1.user_id, t2.photo_id, t2.user_id
HAVING count(*) >= 1 -- j common tags
) sub1
GROUP BY 1, 2
HAVING count(*) >= 1 -- k other photos
) sub2
JOIN photos p USING (id, user_id);
In Postgres 9.3 or later you could use a correlated subquery with a LATERAL
join ...
The above are probably even faster than my first query:
SELECT *
FROM (
SELECT photo_id, user_id
FROM tags t
GROUP BY 1, 2
HAVING (
SELECT count(*) >= 1
FROM (
SELECT photo_id, user_id
FROM tags
WHERE tag = ANY(array_agg(t.tag))
AND (photo_id <> t.photo_id OR
user_id <> t.user_id)
GROUP BY 1, 2
HAVING count(*) >= 2
) t1
)
) t
JOIN photos p ON p.id = t.photo_id
AND p.user_id = t.user_id;
SQL Fiddle showing both on Postgres 9.3.
The 1st query just needs the right basic indexes.
For the 2nd, I would build a materialized view with integer arrays, install the intarray module, a GIN index on the integer array column for better performance ...
Related:
It would be much more efficient to have a single column serial PK for photos
and only store IDs of tags per photo ...:
CREATE TABLE photo (
photo_id serial PRIMARY KEY
, user_id int NOT NULL
);
CREATE TABLE tag (
tag_id serial PRIMARY KEY
, tag text UNIQUE NOT NULL
);
CREATE TABLE photo_tag (
photo_id int REFERENCES (photo)
, tag_id int REFERENCES (tag)
, PRIMARY KEY (photo_id, tag_id)
);
Would make the query much simpler and faster, too.