Search code examples
linuxbashgrepxargsxattr

Format xargs output to grep


I have a script that I'm trying to optimize with xargs. The current version uses find with -exec to call the command:

find -type f -iname "*.mp4" -print0 -printf '\n' -exec getfattr -d --absolute-names {} \;

after which I can pipe to grep with something like:

grep -z -P user\.md5\=\"$input_search_hash\"

to filter the results while keeping the whole output with -z.

I need the whole output returned from getfattr to be "preserved", per file, because I need the filename for which there is a matching extended attribute, which then is then passed to sed to extract it. There are also cases where I have multiple grep commands in sequence if I need to search for files with multiple matches in the extended attributes. The problem is that the output of:

find -type f -iname "*.mp4" -print0 | xargs -0 getfattr -d --absolute-names

is not formatted in such a way that grep will filter in this way. This does work with the -exec method. Can I pass an addional option to xargs or pipe in some additional command that will format the output to make grep properly replicate the behaviour of -exec? I'm guessing I need some sort of line-break before feeding to grep like what -printf '\n' does in the -exec method. I would just use getfattr to "search" the extended attributes instead of needing to grep the output at all, but it has no way to do this by suppling a xattr name and value.

Example

The input comes from the find command, which is a list of video files in an arbitrary directory structure. The output of each getfattr command, for each file is such:

# file: /path/to/file/test.mp4
user.md5="0e29a7f555af518872771689e28d998d"
user.quality="10"
user.sha256="d49ba58e3b30f4ef8c81d19ce960edcf6552977bb8adb79b5b9a677ba9a54b2b"
user.size="1645645"

If I attempt to grep the output of find using the + method, say for a value of "10" on the quality, I will get results like this:

# file: /path/to/file/test.mp4
user.md5="8cf97b888e6fdbed27b02233cd6779f5"
user.quality="12"
user.sha256="613d16b2a0270e2e5f81cfd58b1eacf710a65b82ce2dab49a1e415275440f429"
user.size="1645645"

# file: /path/to/file/test1.mp4
user.md5="3c5a39f1ceefce1e124bcd6786a99155"
user.quality="10"
user.sha256="0d7128a7642d24ea879bbfb3de812b7939b618d8af639f07d5104c954c8049c3"
user.size="5674567"

# file: /path/to/file/test2.mp4
user.md5="0e29a7f555af518872771689e28d998d"
user.quality="6"
user.sha256="d49ba58e3b30f4ef8c81d19ce960edcf6552977bb8adb79b5b9a677ba9a54b2b"
user.size="15645"

All files that find locates are returned and the string to be searched from grep, in this example user.quality="10", is highlighted, but the other files test.mp4 and test2.mp4 still have the output printed post-grep. In other words, find may locate 1000 mp4 files of which maybe 20 have a user.quality="10" entry, but even applying grep to search for that string still returns 1000 filenames (after sed).

This does not happen when using \;. The only thing I would get out from grep would be:

# file: /path/to/file/test.mp4
user.md5="3c5a39f1ceefce1e124bcd6786a99155"
user.quality="10"
user.sha256="0d7128a7642d24ea879bbfb3de812b7939b618d8af639f07d5104c954c8049c3"
user.size="5674567"

This is the expected behaviour.


Solution

  • xargs vs find -exec

    To me it seems like you want to use xargs instead of find -exec {} \; to speed things up.

    Yes, xargs is faster than find -exec {} \;, not because it does the same work more efficiently, but because it does different work!

    • find -exec {} \; calls once for each file (getfattr file1, then getfattr file2, and so on).
    • xargs crams as many files into one call as possible (getfattr file1 file2 file3 ...).
      The same behavior (and even more speedup) can be achieved with find -exec {} + -- no need to use xargs for that.

    With xargs and find -exec {} + you loose control over the output format. There is only one call of getfattr so that program decides what to print between file1, file2 and so on. getfattr has no option to customize its output format.

    No problem! You can ...

    Parse getfattr's output

    ... pretty easily.
    For starters, we assume that all path names are pretty normal. Spaces, *, and ? are ok though. For really unusual path names containing backslashes and linebreaks see the last section.

    If you output only the relevant attribute using -n user.md5 instead of -d, then you know that the output (if any) for each file is always of the form

    # file: path in a single line
    user.md5=encoded value of the attribute
    

    Files without the attribute user.md5 are not printed at all. They cause a warning on stderr which can be suppressed by 2> /dev/null.

    Now, grep for matching attributes. Use grep -B1 to print the line above each match (i.e. the path) too. Then use sed -n or grep -o to extract the filenames.

    find -type f -iname '*.mp4' -exec getfattr -n user.md5 --absolute-names {} + 2> /dev/null |
    grep -B1 -Fx "user.md5=\"$input_search_hash\"" |
    sed -n 's/^# file: //p'
    

    Above command prints the paths of all mp4 files having the attribute user.md5 with value $input_search_hash.

    Handling Unusual Filenames

    At least my version (getfattr 2.4.48 by Andreas Gruenbacher) on Debian 10 always prints the file name in a single line. Linebreaks are encoded using \012 and backslashes are encoded using \134. Therefore, safe processing of those files is possible.

    Above command works, but prints only the encoded file names. To get the actual filenames you have to extend the sed command or add another command to interpret octal escape sequences. For me, getfattr only escapes \n, \r and \\, thus sed 's:\\012:\n:g;s:\\015:\r:g;s:\\134:\\:g' should be sufficient for printing. For further processing, you may want to use tr \\n \\0 | sed -z ... instead, such that filenames are separated by null bytes.

    To test which characters are escaped for you, create a filename containing all allowed bytes and let getfattr print its name:

    f=$(printf $(printf '\\%o' $(seq 1 255)) | tr -d /)
    touch "$f"
    setfattr -n user.md5 -v 123 "$f"
    getfattr -n user.md5  "$f"
    rm "$f"