I have a django blog, and I am writing a simple similiar text algorithm for it. The code below is the code that I tested with a copy of my blog's database. (Note: code was originally in Turkish, I changed variable names to English for convenience. Therefore, things may look weird.)
# -*- coding:utf-8 -*-
from django.utils.html import strip_tags
import os
import sys
import math
import re
PROJECT_FOLDER = os.path.abspath(os.path.dirname(__file__))
UPPER_FOLDER = os.path.abspath(PROJECT_FOLDER + "/../")
sys.path.append(UPPER_FOLDER)
os.environ["DJANGO_SETTINGS_MODULE"] = "similarity.settings"
from blog.models import Post
def getWords(post_object):
all = post_object.title + " " + post_object.abstract + " " + post_object.post
all = strip_tags(all.lower())
regex = re.compile("\W+",flags=re.UNICODE)
return re.split(regex,all)
def count_things(what_to_count,the_set):
num = 0
for the_thing in the_set:
if what_to_count in the_thing[1]:
num += 1
return num
a = Post.objects.all()
b = []
for post in a:
b.append((post.title,getWords(post)))
del(a)
def adjustWeight(the_list,the_word):
numOccr = the_list.count(the_word)
if numOccr == 0:
return 0
else:
return math.log(numOccr,1.6)
results = []
uniques = []
for i in range(0,len(b)):
for a_word in b[i][1]:
if a_word not in uniques:
uniques.append(a_word)
for i in range(1,len(b)):
for j in range(0,i):
upper_part = 0
sum1 = 0
sum2 = 0
for a_word in uniques:
adjusted1 = adjustWeight(b[i][1],a_word)
adjusted2 = adjustWeight(b[j][1],a_word)
upper_part += adjusted1 * adjusted2 * math.log(len(b)/count_things(a_word,b))
sum1 += adjusted1
sum2 += adjusted2
lower_part = math.sqrt(sum1 * sum2)
results.append((b[i][0], b[j][0], upper_part/lower_part))
results = sorted(results, key = lambda x: x[2])
results.reverse()
print("\n".join(["%s and %s => %f" % (x,c,v) for x,c,v in results]).encode("utf-8"))
What it does, in a nutshell is, compare all possible pairs and outputs a similarity report. Now what I want is to merge this with my blog. However, this is a very expensive code, so need some optimazing. This is what I have in mind.
I will have a cron job for a python file, where it compares newly added or modified texts with all other texts, and store similarity scores in database for use.
Another thing I have in mind is, open another table and made some indexing on it like this: "post id" "word" "number of occurence", so instead of reading the post, counting the words everytime, I would just read that data from database, in which everything is already done.
I was wondering what do you thing about this. I wanted to get idea of others since I am not expert on the issue.
If you want to do text similarity based searching, you are better off going with a search server like Sphinx: http://sphinxsearch.com/