Search code examples
pythontextnlpsimilaritycosine-similarity

Cosine similarity of a new text document with existing list of documents


I have a dataframe of 1000 text documents with corresponding keywords.I want to extract keywords of a new document by finding the keywords corresponding to the documents in the list which is most similar.


Solution

  • First save your csv to a dataframe df and use the below functions for cosine similarity calculation. def get_cosine(vec1, vec2): intersection = set(vec1.keys()) & set(vec2.keys()) numerator = sum([vec1[x] * vec2[x] for x in intersection])

    sum1 = sum([vec1[x]**2 for x in vec1.keys()])
    sum2 = sum([vec2[x]**2 for x in vec2.keys()])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)
    
    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator
    
    def text_to_vector(text):
    word = re.compile(r'\w+')
    words = word.findall(text)
    return Counter(words)
    
    def get_result(content_a, content_b):
    text1 = content_a
    text2 = content_b
    
    vector1 = text_to_vector(text1)
    vector2 = text_to_vector(text2)
    
    cosine_result = get_cosine(vector1, vector2)
    return cosine_result
    

    Then iterate over the df and invoke the functions as below:

    similarity=[]
    for ind in df.index:
    #my_doc="new document should go in here"
    #prev_doc= "previous document for each index should go in here"
    cos=get_result(my_doc, prev_doc)
    similarity.append(cos)
    max_ind= similarity.index(max(similarity))  
    

    You will get the index position of the most similar document