I need help with difflib
module.
I'm using difflib (https://docs.python.org/3/library/difflib.html) to compare 2 txt from url, line by line, and find duplications and missing lines. difflib flag with a '-' each line that it's only unique in one of those txt, but, when I run the code in python, I can see some lines flagged with '-' but those lines are present in both txt (it shouldn't, it should be present only in one of these txt, not both).
These are the 2 txt I compare: https://sumo.media/ads_1.txt --- https://sumo.media/ads_2.txt
Does anyone knows why it happens? I show you a screenshot at the end, with the output ussing difflib. Look at the line 'appnexus.com, 8610, DIRECT, f5ab79cb980f11d1' (which contains a '-' at the beginning, telling me that it's unique in https://sumo.media/ads_1.txt). This is not true because If I go to both txt urls, I can see this line in both txt.
What is strange is that if I analyze fewer lines, it works, but it does not work with lot of lines. I need to analyze large amount of lines so I need a solution. Any idea? any alternative maybe?
I also attach the code I run. The way I do this is getting both txt urls with request and asign a variable for each one. Then I apply a splitlines() and it returns an array with a value for each line (as string). I get 2 arrays, one for each txt. Finally I compare these 2 arrays to see which lines are duplicated or missing:
adstxt_1 = requests.get('http://www.sumo.media/ads_1.txt').text
adstxt_2 = requests.get('http://www.sumo.media/ads_2.txt').text
a = adstxt_1.splitlines() # split line by line
b = adstxt_2.splitlines() # split line by line
differ = difflib.Differ()
diffs = list(differ.compare(a, b))
for c in diffs:
print(c)
What the code tells me (this line for ex start with '-' which should be unique in ads_1.txt): python output
... but I see this same line in both .txt: /ads_1.txt --- /ads_2.txt
Appreciate any help!
diff
doesn't check if line is unique in all file but if line is in the same place in other file - so you should first sort lines.
But If you want to check if lines exist in both files or if they unique in one file then better convert to set()
and compare sets.
Minimal working code
a = ['A', 'B', 'C']
b = ['A', 'C', 'D']
print('a:', a)
print('b:', b)
set_a = set(a)
set_b = set(b)
print('--- duplicated ---')
duplicated = set_a & set_b
for item in sorted(duplicated):
print(item)
print('--- unique a ---')
unique_a = set_a - set_b
for item in sorted(unique_a):
print(item)
print('--- unique b ---')
unique_b = set_b - set_a
for item in sorted(unique_b):
print(item)
Result
a: ['A', 'B', 'C']
b: ['A', 'C', 'D']
--- duplicated ---
A
C
--- unique a ---
B
--- unique b ---
D