python performance nltk string-comparison cosine-similarity

Python: is this an inefficient way to compare and sort lists of strings?

I have two lists of strings, A and B. For each string in A, I'd like to compare it to every string in B and select the most similar match. The comparison function that I'm using is a custom cosine similarity measure that I found on this question. Here is how it works:

import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer
nltk.download('punkt')

stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)

def stem_tokens(tokens):
    return [stemmer.stem(item) for item in tokens]

def normalize(text):
    return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))

vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')

def cosine_sim(text1, text2):
    tfidf = vectorizer.fit_transform([text1, text2])
    return ((tfidf * tfidf.T).A)[0,1]

My issue is that if I have somewhat long lists (500-1000 items), and the execution starts to take five or ten minutes. Here's an example using some dummy text:

import requests    
url = 'https://gist.githubusercontent.com/WalkerHarrison/940c005aa23386a69282f373f6160221/raw/6537d999b9e39d62df3784d2d847d4a6b2602876/sample.txt'
sample = requests.get(url).text
A, B = sample[:int(len(sample)/2)], sample[int(len(sample)/2):]
A, B = list(map(''.join, zip(*[iter(A)]*100))), list(map(''.join, zip(*[iter(B)]*100)))

Now that I have two lists, each with ~500 strings (of 100 characters each), I compute the similarities and take the top one. This is done by taking a string from A, iterating through B, sorting by cosine_sim score, and then taking the last element, and then repeating for all elements in A:

matches = [(a, list(sorted([[b, cosine_sim(a, b)] 
                            for b in B], key=lambda x: x[1]))[-1])
           for a in A]

The output is a list of matches where each item contains both strings and also their calculated similarity score. That final line took 7 minutes to run though. I'm wondering if there are inefficiencies in my process that are slowing it down or if there's just a lot to compute (500*500 = 250,000 comparisons, plus sorting for the best 500 times)?

Solution

The biggest issue probably is that you are computing tfidf for every pair of documents (document here merely meaning your unit of text - this could be a tweet, a sentence, a scientific paper, or a book). Also, you shouldn't cook up your own similarity measure if it already exists. Finally, sklearn has a pairwise_distance routine that does what you want and is optimized. Putting this all together, here is a sample script:

import requests
import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import pairwise_distances

url = 'https://gist.githubusercontent.com/WalkerHarrison/940c005aa23386a69282f373f6160221/raw/6537d999b9e39d62df3784d2d847d4a6b2602876/sample.txt'
sample = requests.get(url).text.split('\n\n') # I'm splitting the document by "paragraphs" since it is unclear what you actually want

stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)

def stem_tokens(tokens):
    return [stemmer.stem(item) for item in tokens]

def normalize(text):
    return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))

vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')


doc_vectors = vectorizer.fit_transform(sample)
distances = pairwise_distances(doc_vectors, metric='cosine')


row_idx = list(enumerate(distances.argmax(axis=1)))
sorted_pairs = sorted(row_idx, key= lambda x: distances[x[0], x[1]], reverse=True)

# most similar documents:

i1, i2 = sorted_pairs[0] # index of most similar pair

print(sample[i1])
print("=="*50)
print(sample[i2])

There were 99 documents in my sample list, and this ran pretty much instantaneously after the download was complete. Also, the output:

Art party taxidermy locavore 3 wolf moon occupy. Tote bag twee tacos listicle, butcher single-origin coffee raclette gentrify raw denim helvetica kale chips shaman williamsburg man braid. Poke normcore lomo health goth waistcoat kogi. Af next level banh mi, deep v locavore asymmetrical snackwave chillwave. Subway tile viral flexitarian pok pok vegan, cardigan health goth venmo artisan. Iceland next level twee adaptogen, dreamcatcher paleo lyft. Selfies shoreditch microdosing vape, knausgaard hot chicken pitchfork typewriter polaroid lyft skateboard ethical distillery. Farm-to-table blue bottle yr artisan wolf try-hard vegan paleo knausgaard deep v salvia ugh offal snackwave. Succulents taxidermy cornhole wayfarers butcher, street art polaroid jean shorts williamsburg la croix tumblr raw denim. Hot chicken health goth taiyaki truffaut pop-up iceland shoreditch fingerstache.

====================================================================================================

Organic microdosing keytar thundercats chambray, cray raclette. Seitan irony raclette chia, cornhole YOLO stumptown. Gluten-free palo santo beard chia. Whatever bushwick stumptown seitan cred quinoa. Small batch selfies portland, cardigan you probably haven't heard of them shabby chic yr four dollar toast flexitarian palo santo beard offal migas. Kinfolk pour-over glossier, hammock poutine pinterest coloring book kitsch adaptogen wayfarers +1 tattooed lomo yuccie vice. Plaid fixie portland, letterpress knausgaard sartorial live-edge. Austin adaptogen YOLO cloud bread wayfarers cliche hammock banjo. Sustainable organic air plant mustache.