I am trying to build an algorithm to return a list of images with the highest relevance to a user.
So a user will have a List of Tags and the occurrences of each.
So the tags associated with the user will look as follows(map):
Photography ->4
trees ->3
nature ->3
snow ->2
lake ->2
sky ->2
In my database, I have a list of images with tags. An example would be:
**Image1**: photography, animals, nature, snow
**Image2**: photography, trees, lake, sky
**Image3**: sky, animals, dark, moon
So using a map of tags, I want to search the database for images with the highest similarity and giving weight to those with more occurrences.
So an image with the tags: photography, trees, nature present would be giving a higher weight than an image with the tags: nature, sky, moon
I tried taking the top occurrences, eg. photography and search all these images, then from this result List, search the next word from the map, in this case, trees and then interchange trees for nature as it has the same occurrences, thus returning a list of ranked images.
I am doing in java and using MySQL to store the userTags & occurrences and the images & tags.
I feel there is a better way to return a ranked list based on this information.
The way I am doing it is by giving each image a score based on their tags and the corresponding value.
So image 1 would be a score of photography, animals, nature, snow (4+0+3+2) = 8
And image 2 photography, trees, lake, sky(4+3+2+2) = 11
And image 3 sky, animals, dark, moon (2+0+0+0) = 2
So the ranked list return would be Image 2, Image 1, Image 3 based on this score
This is probably exactly why we have nosql databases, but it can of course by done in sql. Create an entity for each tag (NB I started with the example set, but gave up trying to represent it all):
create table tagged(WHO varchar2(20),tag VARCHAR2(20),what varchar2(20));
/* WHAT I tagged */
INSERT INTO TAGGED VALUES('USER','PHOTOGRAPHY','IMAGEX');
INSERT INTO TAGGED VALUES('USER','PHOTOGRAPHY','IMAGEY');
INSERT INTO TAGGED VALUES('USER','PHOTOGRAPHY','IMAGEZ');
INSERT INTO TAGGED VALUES('USER','PHOTOGRAPHY','IMAGEA');
INSERT INTO TAGGED VALUES('USER','TREES','IMAGEB');
INSERT INTO TAGGED VALUES('USER','TREES','IMAGEC');
INSERT INTO TAGGED VALUES('USER','TREES','IMAGED');
INSERT INTO TAGGED VALUES('USER','NATURE','IMAGEE');
INSERT INTO TAGGED VALUES('USER','NATURE','IMAGEF');
INSERT INTO TAGGED VALUES('USER','NATURE','IMAGEG');
/* a subset of interest */
insert into tagged VALUES('JOE','PHOTOGRAPHY','IMAGE1');
insert into tagged VALUES('FRED','ANIMALS','IMAGE1');
insert into tagged VALUES('WILMA','TREES','IMAGE1');
insert into tagged VALUES('WILMA','NATURE','IMAGE1');
now create a view summarising these (this could be done in sql, but I find views can aid comprehension at the early stage).
create view popularity as select tag,what,count(*) popularity from tagged group by tag,what;
and now we can select the most popular things based on the above algorithm* like so:
select p.what,sum(p.popularity)
from popularity p,tagged u
where u.who='USER'
and u.tag=p.tag
group by p.what
order by 2 desc;
IMAGE1 10
IMAGEA 4
IMAGEX 4
IMAGEZ 4
IMAGEY 4
IMAGEF 3
IMAGED 3
IMAGEE 3
IMAGEC 3
IMAGEB 3
IMAGEG 3
** it also counts images I've already tagged - I will leave it as an exercise for the user to exclude these