Search code examples
pythonalgorithmoptimization

How can I optimize a search over two sets of files to merge them?


I've got a bunch of .txt files with alphabetically-sorted file names:

aaa.txt
aab.txt
aac.txt
.
.
.
zzz.txt

And another set of .txt files stored in a different location:

ant.txt
bat.txt
cat.txt
lion.txt
...

I want to take the text of .txt files in the latter group and append them to the appropriate file in the first group. (There is a chance that a file in the second group does not exist in the first group.)
For example, I want to take the contents of ant.txt from the second group and append that to ant.txt in the first group.

How can I do this efficiently?
The obvious way is:

for file in second group:
   for file in first group:
      # check if the names are identical; if they are, append

But this seems really inefficient. A human trying to find cat.txt in the first group wouldn't start searching from aaa.txt, they'd immediately jump to the files that start with ca.

I imagine one way to optimize this is to remove cat.txt from the search once it's been updated, possibly by storing the updated file in a third directory, and deleting cat.txt from the first group(?)

If it matters, I'm using Python.


Solution

  • You can use os.listdir along with os.path.isfile to find the files you are interested in.

    Then use open(filename, 'r') to read the content of a file and open(filename, 'a') to append to the end of a file.

    import os
    import sys
    
    def append_second_group_to_first_group(target_folder_path, source_folder_path):
        for filename in os.listdir(source_folder_path):
            source_file_path = os.path.join(source_folder_path, filename)
            if os.path.isfile(source_file_path):
                target_file_path = os.path.join(target_folder_path, filename)
                if os.path.isfile(target_file_path):
                    with open(target_file_path, 'a') as out_f:
                        with open(source_file_path, 'r') as in_f:
                            out_f.writelines(in_f)
    
    if __name__ == '__main__':
        target, source = sys.argv[1:3]
        append_second_group_to_first_group(target, source)
    

    Testing with two folders 1/ and 2/ containing a few text files:

    Before

    $ for f in 1/*; do     echo $f; cat $f; echo; done
    1/aaa.txt
    aaa
    
    1/ant.txt
    ant
    
    1/bbb.txt
    bbb
    
    1/cat.txt
    cat
    
    $ for f in 2/*; do     echo $f; cat $f; echo; done
    2/ant.txt
    Ants are so cool
    
    2/bat.txt
    Bats fly under the radar
    
    2/cat.txt
    Cats are sleeping in my armchair
    

    After

    $ python3 appendfiles.py 1 2
    $ for f in 1/*; do     echo $f; cat $f; echo; done
    1/aaa.txt
    aaa
    
    1/ant.txt
    ant
    Ants are so cool
    
    1/bbb.txt
    bbb
    
    1/cat.txt
    cat
    Cats are sleeping in my armchair
    

    Note how the content of the bat.txt file was not copied, because there is no bat.txt file in folder 1/. It's unclear in the current wording of your question whether you want bat.txt to be copied or not in this situation. If you do want bat.txt to be copied, then simply remove line if os.path.isfile(target_file_path): from the code above.