Search code examples
pythonvectortf-idfsentence-similaritylatent-semantic-analysis

Semantic Similarity between Sentences in a Text


I have used material from here and a previous forum page to write some code for a program that will automatically calculate the semantic similarity between consecutive sentences across a whole text. Here it is;

The code for the first part is copy pasted from the first link, then I have this stuff below which I put in after the 245 line. I removed all excess after line 245.

with open ("File_Name", "r") as sentence_file:
    while x and y:
        x = sentence_file.readline()
        y = sentence_file.readline()
        similarity(x, y, true)           
#boolean set to false or true 
        x = y
        y = sentence_file.readline() 

My text file is formatted like this;

Red alcoholic drink. Fresh orange juice. An English dictionary. The Yellow Wallpaper.

In the end I want to display all the pairs of consecutive sentences with the similarity next to it, like this;

["Red alcoholic drink.", "Fresh orange juice.", 0.611],

["Fresh orange juice.", "An English dictionary.", 0.0]

["An English dictionary.", "The Yellow Wallpaper.",  0.5]

if norm(vec_1) > 0 and if norm(vec_2) > 0:
    return np.dot(vec_1, vec_2.T) / (np.linalg.norm(vec_1)* np.linalg.norm(vec_2))
 elif norm(vec_1) < 0 and if norm(vec_2) < 0:
    ???Move On???

Solution

  • This should work. There's a few things to note in the comments. Basically, you can loop through the lines in the file and store the results as you go. One way to process two lines at a time is to set up an "infinite loop" and check the last line we've read to see if we've hit the end (readline() will return None at the end of a file).

    # You'll probably need the file extention (.txt or whatever) in open as well
    with open ("File_Name.txt", "r") as sentence_file:
        # Initialize a list to hold the results
        results = []
    
        # Loop until we hit the end of the file
        while True:
            # Read two lines
            x = sentence_file.readline()
            y = sentence_file.readline()
    
            # Check if we've reached the end of the file, if so, we're done
            if not y:
                # Break out of the infinite loop
                break
            else:
                # The .rstrip('\n') removes the newline character from each line
                x = x.rstrip('\n')
                y = y.rstrip('\n')
    
                try: 
                    # Calculate your similarity value
                    similarity_value = similarity(x, y, True)
    
                    # Add the two lines and similarity value to the results list
                    results.append([x, y, similarity_value])
                except:
                    print("Error when parsing lines:\n{}\n{}\n".format(x, y))
    
    # Loop through the pairs in the results list and print them
    for pair in results:
        print(pair)
    

    Edit: In regards to issues you're getting from similarity(), if you want to simply ignore the line pairs that are causing these errors (without looking at the source in depth I really have no idea what's going on), you can add a try, catch around the call to similarity().