Search code examples
pythonoptimizationtwitternlp

Reducing runtime for cosine similarity calculations between 2 lists in Python


I'm assembling a twitter hashtag dictionary using Python. The keys are the hashtag itself and the corresponding entry is a large collection of tweets that contain this hashtag appended end-to-end. I've got a separate list of all hashtagless tweets and am adding them to dictionary entries according to cosine similarity. Everything is working but is VERY slow (a few hours for 4000 tweets). The nested for loops are giving me O(N^2) runtime. Does anyone have any ideas on how I could improve my runtime? Any suggestions will be greatly appreciated!

taglessVects = normalize(vectorizer.transform(needTags))
    dictVects = normalize(vectorizer.transform(newDict))

   #newDict contains: newDict[hashtag]: "tweets that used that hashtag"
   #needTags is a list of all the tweets that didn;t use a hashtag
    for dVect, entry in zip(dictVects, newDict):
        for taglessVect, tweet in zip(taglessVects, needTags):
            if cosine_similarity(taglessVect, dVect) > .9:
                newDict[entry] = newDict[entry] + ' ' + tweet


    return newDict


Solution

  • You have created a brute force nearest neighbor algorithm using cosine distance as the metric. The sklearn docs on this topic are good.

    Sklearn implements more optimized versions which should be faster. You can use them, but need to change your dictionaries. You'll want some way to map from a vector to the corresponding tweet.

    from sklearn.neighbors import NearestNeighbors
    neigh = NearestNeighbors(1, metric='cosine')
    neigh.fit(dictVects)  
    
    nearest = neigh.kneighbors(taglessVects, return_distance=False)
    for x, y in zip(taglessVects, nearest):
        z = y[x][0]
        # z is the nearest tweet vector to the tagless vector x