I have a script that I'm trying to optimize with xargs
. The current version uses find
with -exec
to call the command:
find -type f -iname "*.mp4" -print0 -printf '\n' -exec getfattr -d --absolute-names {} \;
after which I can pipe to grep
with something like:
grep -z -P user\.md5\=\"$input_search_hash\"
to filter the results while keeping the whole output with -z
.
I need the whole output returned from getfattr
to be "preserved", per file, because I need the filename for which there is a matching extended attribute, which then is then passed to sed
to extract it. There are also cases where I have multiple grep
commands in sequence if I need to search for files with multiple matches in the extended attributes. The problem is that the output of:
find -type f -iname "*.mp4" -print0 | xargs -0 getfattr -d --absolute-names
is not formatted in such a way that grep
will filter in this way. This does work with the -exec
method. Can I pass an addional option to xargs
or pipe in some additional command that will format the output to make grep properly replicate the behaviour of -exec
? I'm guessing I need some sort of line-break before feeding to grep
like what -printf '\n'
does in the -exec
method. I would just use getfattr
to "search" the extended attributes instead of needing to grep
the output at all, but it has no way to do this by suppling a xattr name and value.
The input comes from the find
command, which is a list of video files in an arbitrary directory structure. The output of each getfattr
command, for each file is such:
# file: /path/to/file/test.mp4
user.md5="0e29a7f555af518872771689e28d998d"
user.quality="10"
user.sha256="d49ba58e3b30f4ef8c81d19ce960edcf6552977bb8adb79b5b9a677ba9a54b2b"
user.size="1645645"
If I attempt to grep
the output of find
using the +
method, say for a value of "10" on the quality, I will get results like this:
# file: /path/to/file/test.mp4
user.md5="8cf97b888e6fdbed27b02233cd6779f5"
user.quality="12"
user.sha256="613d16b2a0270e2e5f81cfd58b1eacf710a65b82ce2dab49a1e415275440f429"
user.size="1645645"
# file: /path/to/file/test1.mp4
user.md5="3c5a39f1ceefce1e124bcd6786a99155"
user.quality="10"
user.sha256="0d7128a7642d24ea879bbfb3de812b7939b618d8af639f07d5104c954c8049c3"
user.size="5674567"
# file: /path/to/file/test2.mp4
user.md5="0e29a7f555af518872771689e28d998d"
user.quality="6"
user.sha256="d49ba58e3b30f4ef8c81d19ce960edcf6552977bb8adb79b5b9a677ba9a54b2b"
user.size="15645"
All files that find
locates are returned and the string to be searched from grep
, in this example user.quality="10"
, is highlighted, but the other files test.mp4 and test2.mp4 still have the output printed post-grep. In other words, find
may locate 1000 mp4 files of which maybe 20 have a user.quality="10"
entry, but even applying grep
to search for that string still returns 1000 filenames (after sed
).
This does not happen when using \;
. The only thing I would get out from grep
would be:
# file: /path/to/file/test.mp4
user.md5="3c5a39f1ceefce1e124bcd6786a99155"
user.quality="10"
user.sha256="0d7128a7642d24ea879bbfb3de812b7939b618d8af639f07d5104c954c8049c3"
user.size="5674567"
This is the expected behaviour.
xargs
vs find -exec
To me it seems like you want to use xargs
instead of find -exec {} \;
to speed things up.
Yes, xargs
is faster than find -exec {} \;
, not because it does the same work more efficiently, but because it does different work!
find -exec {} \;
calls once for each file (getfattr file1
, then getfattr file2
, and so on).xargs
crams as many files into one call as possible (getfattr file1 file2 file3 ...
).find -exec {} +
-- no need to use xargs
for that.With xargs
and find -exec {} +
you loose control over the output format. There is only one call of getfattr
so that program decides what to print between file1
, file2
and so on. getfattr
has no option to customize its output format.
No problem! You can ...
getfattr
's output... pretty easily.
For starters, we assume that all path names are pretty normal. Spaces, *
, and ?
are ok though. For really unusual path names containing backslashes and linebreaks see the last section.
If you output only the relevant attribute using -n user.md5
instead of -d
, then you know that the output (if any) for each file is always of the form
# file: path in a single line
user.md5=encoded value of the attribute
Files without the attribute user.md5
are not printed at all. They cause a warning on stderr
which can be suppressed by 2> /dev/null
.
Now, grep for matching attributes. Use grep -B1
to print the line above each match (i.e. the path) too. Then use sed -n
or grep -o
to extract the filenames.
find -type f -iname '*.mp4' -exec getfattr -n user.md5 --absolute-names {} + 2> /dev/null |
grep -B1 -Fx "user.md5=\"$input_search_hash\"" |
sed -n 's/^# file: //p'
Above command prints the paths of all mp4 files having the attribute user.md5
with value $input_search_hash
.
At least my version (getfattr 2.4.48
by Andreas Gruenbacher) on Debian 10 always prints the file name in a single line. Linebreaks are encoded using \012
and backslashes are encoded using \134
. Therefore, safe processing of those files is possible.
Above command works, but prints only the encoded file names. To get the actual filenames you have to extend the sed
command or add another command to interpret octal escape sequences. For me, getfattr
only escapes \n
, \r
and \\
, thus sed 's:\\012:\n:g;s:\\015:\r:g;s:\\134:\\:g'
should be sufficient for printing. For further processing, you may want to use tr \\n \\0 | sed -z ...
instead, such that filenames are separated by null bytes.
To test which characters are escaped for you, create a filename containing all allowed bytes and let getfattr
print its name:
f=$(printf $(printf '\\%o' $(seq 1 255)) | tr -d /)
touch "$f"
setfattr -n user.md5 -v 123 "$f"
getfattr -n user.md5 "$f"
rm "$f"