Search code examples
pythonnlpspacycosine-similarity

Python compute cosine similarity on two directories of files


I have two directories of files. One contains human-transcribed files and the other contains IBM Watson transcribed files. Both directories have the same number of files, and both were transcribed from the same telephony recordings.

I'm computing cosine similarity using SpaCy's .similarity between the matching files and print or store the result along with the compared file names. I have attempted using a function to iterate through in addition to for loops but cannot find a way to iterate between both directories, compare the two files with a matching index, and print the result.

Here's my current code:

# iterate through files in both directories
for human_file, api_file in os.listdir(human_directory), os.listdir(api_directory):
    # set the documents to be compared and parse them through the small spacy nlp model
    human_model = nlp_small(open(human_file).read())
    api_model = nlp_small(open(api_file).read())
    
    # print similarity score with the names of the compared files
    print("Similarity using small model:", human_file, api_file, human_model.similarity(api_model))

I've gotten it to work with iterating through just one directory and checked that it has the expected output by printing the file name, but it doesn't work when using both directories. I've also tried something like this:

# define directories
human_directory = os.listdir("./00_data/Human Transcripts")
api_directory = os.listdir("./00_data/Watson Scripts")

# function for cosine similarity of files in two directories using small model
def nlp_small(human_directory, api_directory):
    for i in (0, (len(human_directory) - 1)):
        print(human_directory[i], api_directory[i])

nlp_small(human_directory, api_directory)

Which returns:

human_10.txt watson_10.csv
human_9.txt watson_9.csv

But that's only two of the files, not all 17 of them.

Any pointers on iterating through a matching index on both directories would be much appreciated.

EDIT: Thanks to @kevinjiang, here's the working code block:

# set the directories containing transcripts
human_directory = os.path.join(os.getcwd(), "00_data\Human Transcripts")
api_directory = os.path.join(os.getcwd(), "00_data\Watson Scripts")

# iterate through files in both directories
for human_file, api_file in zip(os.listdir(human_directory), os.listdir(api_directory)):
    # set the documents to be compared and parse them through the small spacy nlp model
    human_model = nlp_small(open(os.path.join(os.getcwd(), "00_data\Human Transcripts", human_file)).read())
    api_model = nlp_small(open(os.path.join(os.getcwd(), "00_data\Watson Scripts", api_file)).read())
    
    # print similarity score with the names of the compared files
    print("Similarity using small model:", human_file, api_file, human_model.similarity(api_model))

And here's most of the output (need to fix a UTF-16 character in one of the files that halts the loop):

nlp_small = spacy.load('en_core_web_sm')
Similarity using small model: human_10.txt watson_10.csv 0.9274665883462793
Similarity using small model: human_11.txt watson_11.csv 0.9348740684005554
Similarity using small model: human_12.txt watson_12.csv 0.9362025469343344
Similarity using small model: human_13.txt watson_13.csv 0.9557355330988958
Similarity using small model: human_14.txt watson_14.csv 0.9088701120190216
Similarity using small model: human_15.txt watson_15.csv 0.9479464053189846
Similarity using small model: human_16.txt watson_16.csv 0.9599724037676819
Similarity using small model: human_17.txt watson_17.csv 0.9367605599306302
Similarity using small model: human_18.txt watson_18.csv 0.8760760037870665
Similarity using small model: human_2.txt watson_2.csv 0.9184563762823503
Similarity using small model: human_3.txt watson_3.csv 0.9287452822270265
Similarity using small model: human_4.txt watson_4.csv 0.9415664367046419
Similarity using small model: human_5.txt watson_5.csv 0.9158895909429551
Similarity using small model: human_6.txt watson_6.csv 0.935313240861153

After I've fixed the character encoding bug I'll be wrapping it in a function so that I can call the large or small model on two directories for the remaining APIs I have to test.


Solution

  • Two minor errors that's preventing you from looping through. For the second example, in the for loop you're only looping through index 0 and index (len(human_directory) - 1)). Instead, you should do for i in range(len(human_directory)): That should allow you to loop through both.

    For the first, I think you might get some kind of too many values to unpack error. To loop through two iterables concurrently, use zip(), so it should look like

    for human_file, api_file in zip(os.listdir(human_directory), os.listdir(api_directory)):