Search code examples
pythondocumentn-gramtf-idfvsm

Simple implementation of N-Gram, tf-idf and Cosine similarity in Python


I need to compare documents stored in a DB and come up with a similarity score between 0 and 1.

The method I need to use has to be very simple. Implementing a vanilla version of n-grams (where it possible to define how many grams to use), along with a simple implementation of tf-idf and Cosine similarity.

Is there any program that can do this? Or should I start writing this from scratch?


Solution

  • Check out NLTK package: http://www.nltk.org it has everything what you need

    For the cosine_similarity:

    
    def cosine_distance(u, v):
        """
        Returns the cosine of the angle between vectors v and u. This is equal to
        u.v / |u||v|.
        """
        return numpy.dot(u, v) / (math.sqrt(numpy.dot(u, u)) * math.sqrt(numpy.dot(v, v))) 
    

    For ngrams:

    
    def ngrams(sequence, n, pad_left=False, pad_right=False, pad_symbol=None):
        """
        A utility that produces a sequence of ngrams from a sequence of items.
        For example:
    
        >>> ngrams([1,2,3,4,5], 3)
        [(1, 2, 3), (2, 3, 4), (3, 4, 5)]
    
        Use ingram for an iterator version of this function.  Set pad_left
        or pad_right to true in order to get additional ngrams:
    
        >>> ngrams([1,2,3,4,5], 2, pad_right=True)
        [(1, 2), (2, 3), (3, 4), (4, 5), (5, None)]
    
        @param sequence: the source data to be converted into ngrams
        @type sequence: C{sequence} or C{iterator}
        @param n: the degree of the ngrams
        @type n: C{int}
        @param pad_left: whether the ngrams should be left-padded
        @type pad_left: C{boolean}
        @param pad_right: whether the ngrams should be right-padded
        @type pad_right: C{boolean}
        @param pad_symbol: the symbol to use for padding (default is None)
        @type pad_symbol: C{any}
        @return: The ngrams
        @rtype: C{list} of C{tuple}s
        """
    
        if pad_left:
            sequence = chain((pad_symbol,) * (n-1), sequence)
        if pad_right:
            sequence = chain(sequence, (pad_symbol,) * (n-1))
        sequence = list(sequence)
    
        count = max(0, len(sequence) - n + 1)
        return [tuple(sequence[i:i+n]) for i in range(count)] 
    

    for tf-idf you will have to compute distribution first, I am using Lucene to do that, but you may very well do something similar with NLTK, use FreqDist:

    http://nltk.googlecode.com/svn/trunk/doc/book/ch01.html#frequency_distribution_index_term

    if you like pylucene, this will tell you how to comute tf.idf

        # reader = lucene.IndexReader(FSDirectory.open(index_loc))
        docs = reader.numDocs()
        for i in xrange(docs):
            tfv = reader.getTermFreqVector(i, fieldname)
            if tfv:
                rec = {}
                terms = tfv.getTerms()
                frequencies = tfv.getTermFrequencies()
                for (t,f,x) in zip(terms,frequencies,xrange(maxtokensperdoc)):
                        df= searcher.docFreq(Term(fieldname, t)) # number of docs with the given term
                            tmap.setdefault(t, len(tmap))
                            rec[t] = sim.tf(f) * sim.idf(df, max_doc)  #compute TF.IDF
                # and normalize the values using cosine normalization
                if cosine_normalization:
                    denom = sum([x**2 for x in rec.values()])**0.5
                    for k,v in rec.items():
                        rec[k] = v / denom