I have a program that pulls addresses off the internet and checks them against a database. It is useful but I'm now trying to introduce a similarity function to compare the address on the internet against the address in my database.
I'm using the below script to check how well cosine similarity compares the addresses:
import string
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
addresses = [
'705 Sherlock House, 221B Baker Street, London NW1 6XE',
'75 Sherlock House, 221B Baker Street, London NW1 6XE',
'Apartment 704 Sherlock House, 221B Baker Street, London NW1 6XE',
'Apartment 705 Sherlock House, 221B Baker Street, London NW1 6XE',
'705, 221B Baker Street, London NW1 6XE',
'75, 221B Baker Street, London NW1 6XE',
'705 Watson House, 219 Baker Street, London NW1 6XE',
'32 Baker Street, London NW1 6XE',
'1060 West Addison, London, W2 6SR',
'705 Sherlock Hse, Baker Street, London, NW1'
]
def clean_address(text):
text = ''.join([word for word in text if word not in string.punctuation])
text = text.lower()
return text
cleaned = list(map(clean_address, addresses))
vectorizer = CountVectorizer()
transformedVectorizer = vectorizer.fit_transform(cleaned)
vectors = transformedVectorizer.toarray()
csim = cosine_similarity(vectors)
def cosine_sim_vectors(vec1, vec2):
vec1 = vec1.reshape(1, -1)
vec2 = vec2.reshape(1, -1)
return cosine_similarity(vec1, vec2)[0][0]
cosine_sim_vectors1 = cosine_sim_vectors(vectors[0], vectors[1])
cosine_sim_vectors2 = cosine_sim_vectors(vectors[0], vectors[2])
cosine_sim_vectors3 = cosine_sim_vectors(vectors[0], vectors[3])
cosine_sim_vectors4 = cosine_sim_vectors(vectors[0], vectors[4])
cosine_sim_vectors5 = cosine_sim_vectors(vectors[0], vectors[5])
cosine_sim_vectors6 = cosine_sim_vectors(vectors[0], vectors[6])
cosine_sim_vectors7 = cosine_sim_vectors(vectors[0], vectors[7])
cosine_sim_vectors8 = cosine_sim_vectors(vectors[0], vectors[8])
cosine_sim_vectors9 = cosine_sim_vectors(vectors[0], vectors[9])
print("705 Sherlock House, 221B Baker Street, London NW1 6XE is {:.1f}% similar to 75 Sherlock House, 221B Baker Street, London NW1 6XE".format(cosine_sim_vectors1 * 100))
print("705 Sherlock House, 221B Baker Street, London NW1 6XE is {:.1f}% similar to Apartment 704 Sherlock House, 221B Baker Street, London NW1 6XE".format(cosine_sim_vectors2 * 100))
print("705 Sherlock House, 221B Baker Street, London NW1 6XE is {:.1f}% similar to Apartment 705 Sherlock House, 221B Baker Street, London NW1 6XE".format(cosine_sim_vectors3 * 100))
print("705 Sherlock House, 221B Baker Street, London NW1 6XE is {:.1f}% similar to 705, 221B Baker Street, London NW1 6XE".format(cosine_sim_vectors4 * 100))
print("705 Sherlock House, 221B Baker Street, London NW1 6XE is {:.1f}% similar to 75, 221B Baker Street, London NW1 6XE".format(cosine_sim_vectors5 * 100))
print("705 Sherlock House, 221B Baker Street, London NW1 6XE is {:.1f}% similar to 705 Watson House, 219 Baker Street, London NW1 6XE".format(cosine_sim_vectors6 * 100))
print("705 Sherlock House, 221B Baker Street, London NW1 6XE is {:.1f}% similar to 32 Baker Street, London NW1 6XE".format(cosine_sim_vectors7 * 100))
print("705 Sherlock House, 221B Baker Street, London NW1 6XE is {:.1f}% similar to 1060 West Addison, London, W2 6SR".format(cosine_sim_vectors8 * 100))
print("705 Sherlock House, 221B Baker Street, London NW1 6XE is {:.1f}% similar to 705 Sherlock Hse, Baker Street, London, NW1".format(cosine_sim_vectors9 * 100))
The output is:
705 Sherlock House, 221B Baker Street, London NW1 6XE is 88.9% similar to 75 Sherlock House, 221B Baker Street, London NW1 6XE
705 Sherlock House, 221B Baker Street, London NW1 6XE is 84.3% similar to Apartment 704 Sherlock House, 221B Baker Street, London NW1 6XE
705 Sherlock House, 221B Baker Street, London NW1 6XE is 94.9% similar to Apartment 705 Sherlock House, 221B Baker Street, London NW1 6XE
705 Sherlock House, 221B Baker Street, London NW1 6XE is 88.2% similar to 705, 221B Baker Street, London NW1 6XE
705 Sherlock House, 221B Baker Street, London NW1 6XE is 75.6% similar to 75, 221B Baker Street, London NW1 6XE
705 Sherlock House, 221B Baker Street, London NW1 6XE is 77.8% similar to 705 Watson House, 219 Baker Street, London NW1 6XE
705 Sherlock House, 221B Baker Street, London NW1 6XE is 68.0% similar to 32 Baker Street, London NW1 6XE
705 Sherlock House, 221B Baker Street, London NW1 6XE is 13.6% similar to 1060 West Addison, London, W2 6SR
705 Sherlock House, 221B Baker Street, London NW1 6XE is 75.6% similar to 705 Sherlock Hse, Baker Street, London, NW1
It's done a reasonable job because I'm probably going to eyeball anything over 60-70%, and I'm impressed it almost caught my deliberate attempt to trick it with 705 Watson House and 705 Sherlock Hse, but I do think it would improve the algorithm if it recognised, for example, that 705 is a more important thing to compare than London or, given I could just remove London, 6XE.
I'm also open to using other similarity functions if there is a more appropriate one because I do understand that cosine similarity is changing the strings to vectors and essentially treating them all equally.
There was no merit to adding more weight to one part of my address string over another, cosine similarity does the job out of the box.
Cosine similarity is a better algorithm for this purpose than string edit distance because '75 Sherlock House, 221B Baker Street, London NW1 6XE' is not more similar to '705 Sherlock House, 221B Baker Street, London NW1 6XE' than 'Apartment 705 Sherlock House, 221B Baker Street, London NW1 6XE' - cosine similarity catches this intuition.