Search code examples
bashmd5uuidxargsgnu-parallel

Copy badly-named files from one deeply-nested dir, content-address rename them, to another flattened sub-directory, in a Bash one-liner?


How do I copy files distributed throughout a deeply-nested sub-directory, to another sub-directory which is not nested at all (i.e., is flat)? To heighten the difficulty level, I have these constraints/wrinkles.

  1. Though the source files have the same extension (*.xlsx), they have spaces in the filenames.
  2. The source sub-directory and all its contents are read-only.
  3. Because of potential name collisions, because of the lousy filenames, because I have a herd of these files and their original names are useless to me, I want to content-address them somehow.
  4. The scripting environment is Bash.
  5. Because of other constraints, it's important to do this in one line.
  6. Extra points for simplicity, because the more esoteric it is the less likely my colleagues will grok this.

I've tried cp, find, xargs, parallel, uuidgen, md5sum, Bash for loops, and various combinations thereof with limited success. The best I've been able to achieve is generating a random UUID for each file. That's OK, I guess, but it's not exactly the "content-addressing" I'd like, because I'd like to de-dupe the files based on their content.

For reference, that looks like this, where source and dest are the source and destination sub-directories.

find source/* -type f -exec sh -c 'for f; do cp "$f" 'dest'/"$(uuidgen)"; done' Renamer {} +

Though UUIDs are nice, I don't have my heart set on them and am open to other ideas, modulo the constraints above.

Thanks!


Solution

  • Use the command md5sum to calculate the md5sum of the content of a file:

    find * -type f -exec sh -c 'for f; do cp "$f" 'dest'/$(md5sum "$f" | sed -e s/[[:space:]].*//); done' _ {} +
    

    This uses sed to massage the output of md5sum to not contain the filename rather than the usual md5sum <file> | awk' {print $1}' so that I don't have to think about escaping quotes.

    Of course, you might have hash collisions with md5, but you can easily switch the hashing to use sha256sum or whatever you like.