Search code examples
pythontextfacet

Using Python, how do I compare many strings (in one file) on equality?


In a .txt file are more than 10k records and in each line is exactly one checksum, which is duplicated sometimes. The goal is to write code to seek out 1. the count of each duplicated checksum 2. and the lines of occurrences of each duplicated record.

The result should be looking like: "4d2da647[..]": Total Counts 42 ; In Lines {5,21,432,3424, 11679, [...]} .. .

I do not have much experience with coding yet, and I am not asking anyone for doing all the work. But looking up the internet, I did not find similar cases and do not know how to orientate.

I started with:

with open("file.txt", "r") as obj:
    lines_list = obj.readlines

# compare lines on equality

# print out total count of duplettes and occurrences in lines

For any guiding information, I would appreciate that very much. Thank you


Solution

  • What you are looking to do is basically create a python dictionary(key-value pair) with the keys being your checksums and count being the value.

    checksum_dict = {}
    
    for line in lines_list:
        if line in checksum_dict:
            checksum_dict[line] += 1
        else:
            checksum_dict[line] = 1
    
    

    Now you have the count of all checksums in this dict and you can easily output the information you need from here.

    From your output example, you also need the linenumbers to be stored, so instead of having a simple count you can store a list for every checksum and add the line numbers to that list.