Search code examples
pythoncsvdictionarycomparedefaultdict

How can I compare 2 CSV files, check if the values of the second column match and count the number of occurrences for each value when they match?


I would like to iterate through 2 CSV files, check when the values in both files match and count how many occurrences for each value take place when they match. The output should be a dictionary.

So I have two CSV files that are aligned. Each have 2 columns: "WORD" and "POS" (Part of speech tag). Click to see example of file 1 Click to see example of file 2

In some cases, each word has been labelled in the same way for both files, but in many other cases it hasn't. I would like to count the amount of times its been labelled the same way for both files.

For example, if file1 has WORD "human" and POS "PERS", and file2 also has WORD "human" and POS "PERS", I would like the output to be: {PERS: 2} This means PERS matched twice in both files. I would like this for each tag: {TAG1: n times it appears and matches both, TAG2: times it appears and matches both, etc }

I was only able to figure out how to read one CSV file and count the number of times each POS tag is used using this code:

import csv 
from collections import defaultdict


def count_NER_tags(filename): 
    """
    Obtains the counts of each tag for the determined csv file  

    """

    dict_NER_counts = defaultdict(int) 

    with open(filename, "r") as csvfile:
        read_csv = csv.reader(csvfile, delimiter="\t")
        next(read_csv) #skip the header 
        for row in read_csv:
             dict_NER_counts[row[2]] += 1

        return dict_NER_counts


output: 
{'O': 42123, 'ORG': 2092, 'LOC': 2094, 'MISC': 1268, 'PERS': 3145}

I don't know how to implement the "if POS in file1 == POS in file2" after reading both CSV files and then add to dictionary with their counts as exemplified in the code above.


Solution

  • I find it a little strange that when the same WORD has the same POS in both files you're calling it two matches instead of one — seems to me like that's would be only one match.

    Whatever...I think the following will do what you want (if I've understood what you want to do correctly).

    import csv
    from collections import defaultdict
    
    def count_tag_matches(filename1, filename2):
        """
        Counts the number of tags that had the same value in both CSV files.
        """
        dict_counts = defaultdict(int)
    
        with open(filename1, "r", newline='') as csvfile1, \
             open(filename2, "r", newline='') as csvfile2:
    
            reader1 = csv.DictReader(csvfile1, delimiter="\t")
            reader2 = csv.DictReader(csvfile2, delimiter="\t")
    
            for row1, row2 in zip(reader1, reader2):
                 if row1['POS'] == row2['POS']:
                     dict_counts[row1['POS']] += 2
    
        return dict(dict_counts)  # Return a regular dictionary.
    
    counts = count_tag_matches('cmp_file1.csv', 'cmp_file2.csv')
    print(counts)
    

    Output from processing the sample files:

    {'A': 2, 'O': 2, 'PERS': 2}