Search code examples
python-2.7nltktf-idfsklearn-pandas

Counting matrix pairs using a threshold


I have a folder with hundreds of txt files I need to analyse for similarity. Below is an example of a script I use to run similarity analysis. In the end I get an array or a matrix I can plot etc.

I would like to see how many pairs there are with cos_similarity > 0.5 (or any other threshold I decide to use), removing cos_similarity == 1 when I compare the same files, of course.

Secondly, I need a list of these pairs based on file names.

So the output for the example below would look like:

1

and

["doc1", "doc4"]

Will really appreciate your help as I feel a bit lost not knowing which direction to go.

This is an example of my script to get the matrix:

doc1 = "Amazon's promise of next-day deliveries could be investigated amid customer complaints that it is failing to meet that pledge."
doc2 = "The BBC has been inundated with comments from Amazon Prime customers. Most reported problems with deliveries."
doc3 = "An Amazon spokesman told the BBC the ASA had confirmed to it there was no investigation at this time."
doc4 = "Amazon's promise of next-day deliveries could be investigated amid customer complaints..."
documents = [doc1, doc2, doc3, doc4]

# In my real script I iterate through a folder (path) with txt files like this:
#def read_text(path):
#    documents = []
#    for filename in glob.iglob(path+'*.txt'):
#        _file = open(filename, 'r')
#        text = _file.read()
#        documents.append(text)
#    return documents

import nltk, string, numpy
nltk.download('punkt') # first-time use only
stemmer = nltk.stem.porter.PorterStemmer()
def StemTokens(tokens):
    return [stemmer.stem(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def StemNormalize(text):
    return StemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

nltk.download('wordnet') # first-time use only
lemmer = nltk.stem.WordNetLemmatizer()
def LemTokens(tokens):
    return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

from sklearn.feature_extraction.text import CountVectorizer
LemVectorizer = CountVectorizer(tokenizer=LemNormalize, stop_words='english')
LemVectorizer.fit_transform(documents)
tf_matrix = LemVectorizer.transform(documents).toarray()

from sklearn.feature_extraction.text import TfidfTransformer
tfidfTran = TfidfTransformer(norm="l2")
tfidfTran.fit(tf_matrix)
tfidf_matrix = tfidfTran.transform(tf_matrix)
cos_similarity_matrix = (tfidf_matrix * tfidf_matrix.T).toarray()

from sklearn.feature_extraction.text import TfidfVectorizer
TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
def cos_similarity(textlist):
    tfidf = TfidfVec.fit_transform(textlist)
    return (tfidf * tfidf.T).toarray()
cos_similarity(documents)

Out:

array([[ 1.        ,  0.1459739 ,  0.03613371,  0.76357693],
       [ 0.1459739 ,  1.        ,  0.11459266,  0.19117117],
       [ 0.03613371,  0.11459266,  1.        ,  0.04732164],
       [ 0.76357693,  0.19117117,  0.04732164,  1.        ]])

Solution

  • As I understood your question, you want to create a function that reads the output numpy array and a certain value (threshold) in order to return two things:

    • how many docs are bigger than or equal the given threshold
    • the names of these docs.

    So, here I've made the following function which takes three arguments:

    • the output numpy array from cos_similarity() function.
    • list of document names.
    • a certain number (threshold).

    And here it's:

    def get_docs(arr, docs_names, threshold):
        output_tuples = []
        for row in range(len(arr)):
            lst = [row+1+idx for idx, num in \
                      enumerate(arr[row, row+1:]) if num >= threshold]
            for item in lst:
                output_tuples.append( (docs_names[row], docs_names[item]) )
    
        return len(output_tuples), output_tuples
    

    Let's see it in action:

    >>> docs_names = ["doc1", "doc2", "doc3", "doc4"]
    >>> arr = cos_similarity(documents)
    >>> arr
    array([[ 1.        ,  0.1459739 ,  0.03613371,  0.76357693],
       [ 0.1459739 ,  1.        ,  0.11459266,  0.19117117],
       [ 0.03613371,  0.11459266,  1.        ,  0.04732164],
       [ 0.76357693,  0.19117117,  0.04732164,  1.        ]])
    >>> threshold = 0.5   
    >>> get_docs(arr, docs_names, threshold)
    (1, [('doc1', 'doc4')])
    >>> get_docs(arr, docs_names, 1)
    (0, [])
    >>> get_docs(lst, docs_names, 0.13)
    (3, [('doc1', 'doc2'), ('doc1', 'doc4'), ('doc2', 'doc4')])
    

    Let's see how this function works:

    • first, I iterate over every row of the numpy array.
    • Second, I iterate over every item in the row whose index is bigger than the row's index. So, we are iterating in a traingular shape like so: and that's because each pair of documents is mentioned twice in the whole array. We can see that the two values arr[0][1] and arr[1][0] are the same. You also should notice that the diagonal items arn't included because we knew for sure that they are 1 as evey document is very similar to itself :).
    • Finally, we get the items whose values are bigger than or equal the given threshold, and return their indices. These indices are used later to get the documents names.