Search code examples
pythonmissing-datatxtdifflib

comparing two .txt, difflib module tells me that a line is unique ('-') when in fact it is present in both .txt


I need help with difflib module.

I'm using difflib (https://docs.python.org/3/library/difflib.html) to compare 2 txt from url, line by line, and find duplications and missing lines. difflib flag with a '-' each line that it's only unique in one of those txt, but, when I run the code in python, I can see some lines flagged with '-' but those lines are present in both txt (it shouldn't, it should be present only in one of these txt, not both).

These are the 2 txt I compare: https://sumo.media/ads_1.txt --- https://sumo.media/ads_2.txt

Does anyone knows why it happens? I show you a screenshot at the end, with the output ussing difflib. Look at the line 'appnexus.com, 8610, DIRECT, f5ab79cb980f11d1' (which contains a '-' at the beginning, telling me that it's unique in https://sumo.media/ads_1.txt). This is not true because If I go to both txt urls, I can see this line in both txt.

What is strange is that if I analyze fewer lines, it works, but it does not work with lot of lines. I need to analyze large amount of lines so I need a solution. Any idea? any alternative maybe?

I also attach the code I run. The way I do this is getting both txt urls with request and asign a variable for each one. Then I apply a splitlines() and it returns an array with a value for each line (as string). I get 2 arrays, one for each txt. Finally I compare these 2 arrays to see which lines are duplicated or missing:

adstxt_1 = requests.get('http://www.sumo.media/ads_1.txt').text
adstxt_2 = requests.get('http://www.sumo.media/ads_2.txt').text


a = adstxt_1.splitlines()    # split line by line
b = adstxt_2.splitlines()    # split line by line

differ = difflib.Differ()
diffs = list(differ.compare(a, b))
for c in diffs:
    print(c)

What the code tells me (this line for ex start with '-' which should be unique in ads_1.txt): python output

... but I see this same line in both .txt: /ads_1.txt --- /ads_2.txt

Appreciate any help!


Solution

  • diff doesn't check if line is unique in all file but if line is in the same place in other file - so you should first sort lines.

    But If you want to check if lines exist in both files or if they unique in one file then better convert to set() and compare sets.


    Minimal working code

    a = ['A', 'B', 'C']
    b = ['A', 'C', 'D']
    
    print('a:', a)
    print('b:', b)
    
    set_a = set(a)
    set_b = set(b)
    
    print('--- duplicated ---')
    
    duplicated = set_a & set_b
    
    for item in sorted(duplicated):
        print(item)
        
    print('--- unique a ---')
    
    unique_a = set_a - set_b
    
    for item in sorted(unique_a):
        print(item)
    
    print('--- unique b ---')
    
    unique_b = set_b - set_a
    
    for item in sorted(unique_b):
        print(item)
    

    Result

    a: ['A', 'B', 'C']
    b: ['A', 'C', 'D']
    --- duplicated ---
    A
    C
    --- unique a ---
    B
    --- unique b ---
    D