Search code examples
pythoniteratorgenerator

Generator function to find common lines between two files


I am reading Python Object-oriented programming by Steven and Dusty.

I am at Chapter 10, and it's all about Iterator design pattern. At the end of chapter there is an exercise to write generator function/expression to find common lines between two files.

Here are my naive approaches:

old = os.path.normpath('E:/testing/old.txt')
new = os.path.normpath('E:/testing/new.txt')
res = []

def func(source):
    for line in source.readlines():
        yield line

Approach 1:

with open(old, "r") as f:
    for line in func(f):
        with open(new, "r") as _f:
            for _line in func(_f):
                if line == _line:
                    res.append(line)

Approach 2:

with open(old, "r") as f:
    for line in func(f):
        with open(new, "r") as _f:
            res.extend(filter(lambda a: line == a, func(_f)))

Context:

Now, let's think about totally different problem, let's assume we have list of strings instead of files with n and m elements, we can find common strings in O(n) time complexity with O(m) space.

Big questions

  1. Is there any better pythonic approach to re-factor above codes?
  2. I know generators are all about saving space and state, but following the above context, is there a better way in terms of time complexity to compare two files and find common lines.

My end goal is to optimize my code (if it is possible) in terms of time and space complexity and to make it more pythonic.

P.S. - Can someone point me to the source code of linux diff command and git-diff source code.

I tried finding source code of linux diff and git-diff but was unable to.


Solution

  • I don't believe this problem is best solved using generators. My approach would be:

    old = 'E:/testing/old.txt'
    new = 'E:/testing/new.txt'
    
    with open(old) as f:
        old_lines = set(f) # A set of lines
    
    with open(new) as f:
        matches = [line for line in f if line in old_lines]
    
    print(matches)
    

    If you want to compare lines stripped of the trailing newline character (since the last line may not have a trailing newline), then you could use the following code, which also demonstrates creating two sets and doing an "intersection" operation on them. In that way you do not have to explicitly loop through lines. But by creating two sets and computing their intersection you lose control over the order in which matches will be printed. It will also be less memory efficient and not run any faster (perhaps even a bit more slowly) compared with the first solution. Finally, if any of the files contain duplicated lines, you will compute a different, smaller set of changes.

    old = 'E:/testing/old.txt'
    new = 'E:/testing/new.txt'
    
    with open(old) as f:
        old_lines = set(line.strip('\n') for line in f) # A set of lines
    
    with open(new) as f:
        matches = set(line.strip('\n') for line in f) & old_lines
    
    print(matches)