Search code examples
pythonlinuxshelldirectory-structurehash-function

How can I create a hash of a directory in Linux in Shell or Python?


What's the easiest way to get a hash-function of a directory in Linux (preferably using shell scripting or Python)?

What I'm trying to do is find duplicate subtrees within a large tree of directories.

fdupes and meld etc. tend to want the two trees to be largely isomorphic, ie. given

A
└─ B

and

A
└─ C
   └─ B

they won't alert me if B is the same in both trees because in the second tree its under C.

So I'm guessing I need to write my own script to recurse down both trees and find hashes of all subtrees and then compare them.


Solution

  • Hash a directory's structure using filenames only

    List all filepaths in the dir (recursively), sort them (in case find messes up), hash it all with sha1sum and print the hash:

    find /my/dir -mindepth 1 -type f -print0 | sort -z | sha1sum
    

    You can put that in a script, like:

    #!/bin/bash
    # hashtree-names.sh - hash a dir's structure by filenames
    # (files with same names are considered identical)
    # Usage: hashtree-names.sh <dirname>
    DIR=$1
    find $DIR -mindepth 1 -type f -print0 | sort -z | sha1sum
    

    And execute it on every dir under a large tree like so:

    find /my/tree -mindepth 1 -type d -exec hashtree-names.sh {} \; | sort
    

    Which will produce output similar to:

    3cd8fea391f3055d9de3d6e05a422b6e97ce4204 *-
    8cd93d83e9baeea479785fe0cc03c8b58aa293a3 *-
    8cd93d83e9baeea479785fe0cc03c8b58aa293a3 *-
    fe7dd981bb0d978608ba648eb3d38bb41f6cd956 *-
    afc483808be60fbd48e716a7b916b5deaa9c78b5 *-
    a518cfa27e7e9afbab2ba2209c80dbab0631736b *-
    251f3cfc11eeccdfaf28142dadc5aa3aa4e2aec1 *-
    251f3cfc11eeccdfaf28142dadc5aa3aa4e2aec1 *-
    4a689e7c27733498c4ac5730f172c844cb6b21d1 *-
    600a61b8c1a973aa6322ab4a7d57f7c07174e0ec *-
    a401f27520252ae334625ca1b452396f0287f42d *-
    e0b2d5f825f062d40f0f2490673888b5eb6c66fd *-
    85a533625c5a38892d392f2ae9e7974e3eceaf6a *-
    

    Hash a directory's structure, complete with file contents

    See Vatine's and David Schmitt's answers to Linux: compute a single hash for a given folder & contents?.

    EDIT 2017-01-27

    • Code improvements: Added -mindepth 1 to find