On a Linux 64bit CentOS server, I am running a GNU find command on several folders, each of them containing a similar subfolder structure. The structure is:
/my/group/folder/project_123/project_123-12345678/*/*file_pattern_at_this_level*
/my/group/folder/project_234/project_234-23456789/*/*file_pattern_at_this_level*
The folder asterisk /*/
is to indicate that there are a bunch of subfolders inside each project folder, of varying names.
I have tried adding the final asterisk and then limiting the find command to a certain -mindepth N
and -maxdepth N
:
find $folder1 $folder2 $folder3 -mindepth 1 -maxdepth 1 -name "*file_pattern*"
But the tests are on a server node that has other running jobs, so it's difficult to get a fair performance comparison, also mainly due to some level of caching taking place after the first command, which makes the first type of command slow and the second equivalent type faster.
This is a multicore node, so what else could I try to make this type of commands faster?
"Actually commands like find and grep are almost always IO-bound: the disk is the bottleneck, not the CPU. In such cases, if you run several instances in parallel, they will compete for I/O bandwidth and cache, and so they will be slower." - https://unix.stackexchange.com/a/111409
Don't worry about "finding" the files, worry about what you need to do with them. For that you can parallelize with "parallel" or "xargs".
If you still want to pursue that, you could still try to use "parallel" together with find, passing a list of directories. That will cause parallel to spawn a bunch of find processes (-j option sets how many "threads" will be running simultaneously) to process the "queue". In this scenario you will be needing to set std out to a file, so you could review the output later, or not, depending on your use.