Search code examples
pythoncopychecksum

Copy a file with checksum


I created a function that copies a file from dir A to B and compares both checksums before removing A.

Now that I've reinvented the wheel. I want to know how I could have done this better. Instead of implementing a new safe_copy() with shutil and hashlib.

  • Are there already libraries doing it in python?
  • Are there already Windows built-ins?
  • Anything built-in anaconda ?

Info:

  • I can't install 3rd party code, I am working on an offline server.
  • Performance is not an issue
  • Paths to files I must copy are given in a pandas DataFrame(origin, destination)

This question is not about performance per say (but that's always a good point to bring up), it's about code reuse in general.


Solution

  • For any starting programmer, it definitely makes sense to dive deeply into something that interests you - if, in your case, that's file management, that's fine of course. Just keep in mind that Python is not at all an optimal language for something that ultimately relies heavily on performance. A language like C++ or Rust might make more sense to learn if that's your passion.

    If you do want to continue developing this in Python regardless, you should definitely read through the standard modules os, shutil, pathlib and hashlib. The program you described could be as simple as:

    from pathlib import Path
    from shutil import copyfile
    from hashlib import md5
    from os import remove
    
    
    def file_md5(fname):
        chunk_size = 16384  # arbitrary
        md5_hash = md5()
        with open(fname, 'rb') as f:
            for chunk in iter(lambda: f.read(chunk_size), b''):
                md5_hash.update(chunk)
        return md5_hash.hexdigest()
    
    
    a = 'C:\temp\a.txt'
    b = 'C:\temp\b.txt'
    if Path(b).is_file():
        print('that file already exists!')
        exit(1)
    else:
        copyfile(a, b)
    
    if file_md5(a) != file_md5(b):
        print('something is not the same')
    else:
        remove(a)
    

    (Don't just run this script if you have an actual C:\temp\a.txt file, obviously)

    There exist thousands of file management utilities that have been developed for decades and are highly optimised, for speed or for very specific functions. In almost any real world project, it would much more sense to combine/package several of those and script them together using a batch language (or perhaps Python) than to rewrite them from the ground up.

    Rewriting can make sense to learn more about how they work internally, but you'll likely find yourself abandoning the work once you understand them. Another reason to rewrite could be because you have a clever idea on how to do it better, but that's where other languages are almost guaranteed to outperform Python.

    Follow-up to comment: there's no single utility in Windows that does a 'safe-copy' in one go, that I'm aware of. I think that's mainly because you can pretty much rely on utilities like robocopy (standard Windows) to fail if there's a problem and to rest assured your copy is good if it completes without error.

    However, I can appreciate wanting to be extra sure, so it would be fairly simple to string something like robocopy together with cmdlets like Get-FileHash from PowerShell. PowerShell is a standard part of Windows as well and writing a .ps1 script is not a lot more complicated than writing a batch file. A simple "copy this file, get and compare filehas and remove the appropriate file based on the outcome"-PowerShell script would be only a few lines, no installation required.