I am reading Python Object-oriented programming by Steven and Dusty.
I am at Chapter 10, and it's all about Iterator design pattern. At the end of chapter there is an exercise to write generator function/expression to find common lines between two files.
Here are my naive approaches:
old = os.path.normpath('E:/testing/old.txt')
new = os.path.normpath('E:/testing/new.txt')
res = []
def func(source):
for line in source.readlines():
yield line
Approach 1:
with open(old, "r") as f:
for line in func(f):
with open(new, "r") as _f:
for _line in func(_f):
if line == _line:
res.append(line)
Approach 2:
with open(old, "r") as f:
for line in func(f):
with open(new, "r") as _f:
res.extend(filter(lambda a: line == a, func(_f)))
Context:
Now, let's think about totally different problem, let's assume we have list of strings instead of files with n
and m
elements, we can find common strings in O(n)
time complexity with O(m)
space.
My end goal is to optimize my code (if it is possible) in terms of time and space complexity and to make it more pythonic.
P.S. - Can someone point me to the source code of linux diff
command and git-diff
source code.
I tried finding source code of linux diff
and git-diff
but was unable to.
I don't believe this problem is best solved using generators. My approach would be:
old = 'E:/testing/old.txt'
new = 'E:/testing/new.txt'
with open(old) as f:
old_lines = set(f) # A set of lines
with open(new) as f:
matches = [line for line in f if line in old_lines]
print(matches)
If you want to compare lines stripped of the trailing newline character (since the last line may not have a trailing newline), then you could use the following code, which also demonstrates creating two sets and doing an "intersection" operation on them. In that way you do not have to explicitly loop through lines. But by creating two sets and computing their intersection you lose control over the order in which matches will be printed. It will also be less memory efficient and not run any faster (perhaps even a bit more slowly) compared with the first solution. Finally, if any of the files contain duplicated lines, you will compute a different, smaller set of changes.
old = 'E:/testing/old.txt'
new = 'E:/testing/new.txt'
with open(old) as f:
old_lines = set(line.strip('\n') for line in f) # A set of lines
with open(new) as f:
matches = set(line.strip('\n') for line in f) & old_lines
print(matches)