What's the easiest way to get a hash-function of a directory in Linux (preferably using shell scripting or Python)?
What I'm trying to do is find duplicate subtrees within a large tree of directories.
fdupes
and meld
etc. tend to want the two trees to be largely isomorphic, ie. given
A
└─ B
and
A
└─ C
└─ B
they won't alert me if B is the same in both trees because in the second tree its under C.
So I'm guessing I need to write my own script to recurse down both trees and find hashes of all subtrees and then compare them.
List all filepaths in the dir (recursively), sort
them (in case find
messes up), hash it all with sha1sum
and print the hash:
find /my/dir -mindepth 1 -type f -print0 | sort -z | sha1sum
You can put that in a script, like:
#!/bin/bash
# hashtree-names.sh - hash a dir's structure by filenames
# (files with same names are considered identical)
# Usage: hashtree-names.sh <dirname>
DIR=$1
find $DIR -mindepth 1 -type f -print0 | sort -z | sha1sum
And execute it on every dir under a large tree like so:
find /my/tree -mindepth 1 -type d -exec hashtree-names.sh {} \; | sort
Which will produce output similar to:
3cd8fea391f3055d9de3d6e05a422b6e97ce4204 *- 8cd93d83e9baeea479785fe0cc03c8b58aa293a3 *- 8cd93d83e9baeea479785fe0cc03c8b58aa293a3 *- fe7dd981bb0d978608ba648eb3d38bb41f6cd956 *- afc483808be60fbd48e716a7b916b5deaa9c78b5 *- a518cfa27e7e9afbab2ba2209c80dbab0631736b *- 251f3cfc11eeccdfaf28142dadc5aa3aa4e2aec1 *- 251f3cfc11eeccdfaf28142dadc5aa3aa4e2aec1 *- 4a689e7c27733498c4ac5730f172c844cb6b21d1 *- 600a61b8c1a973aa6322ab4a7d57f7c07174e0ec *- a401f27520252ae334625ca1b452396f0287f42d *- e0b2d5f825f062d40f0f2490673888b5eb6c66fd *- 85a533625c5a38892d392f2ae9e7974e3eceaf6a *-
See Vatine's and David Schmitt's answers to Linux: compute a single hash for a given folder & contents?.
-mindepth 1
to find