Sorry if the title isn't very fitting, I have actually multiple question, and I wasn't able to separate them.
I wrote a simplified version of the command line "find" program running on macOS that simply looks for every instance of a file name in a specified path and prints the corresponding file paths if any, and date of last modification. It's a single thread program. I used readdir and lstat to get the necessary file info.
Then I compared the performance of my program, let's call it "myfind"(printing as I said both file path and last modified time), to "find"(used with "find path -name filename", to print only the resulting file paths). "myfind" took 1m 40s and "find" just 40s. Both times were consistent during consecutive tries.
Unhappy about the result, I made some "optimizations" that didn't do anything at all (like removing recursion and using a simple stack holding some variables instead).
This is the code I wrote, error checking removed (executing with or without error checking doesn't change execution time):
struct recstruct{
DIR *st_dir;
char pathname[512];
};
//usage: ./execname path filename
int main(int argc, char *argv[]){
chdir(argv[1]);
myfind(argv[2]);
return 0;
}
int myfind(char *name){
DIR *curr_dir = opendir(".");
int stk_i = 0;
struct recstruct stk[128];
struct dirent *next_entry;
struct stat info;
char *entry_name;
char path[512];
errno = 0;
//ignores "." and ".."
readdir(curr_dir);
readdir(curr_dir);
//gets next directory entry until it has fully explored the path
while((next_entry = readdir(curr_dir)) != NULL || stk_i != 0){
//goes back to previous directory if there is any and it's done with the current one
if(next_entry == NULL){
closedir(curr_dir);
stk_i--;
chdir(stk[stk_i].pathname);
curr_dir = stk[stk_i].st_dir;
continue;
}
//gets next directory entry and checks if its name is the one we are looking for
entry_name = next_entry->d_name;
lstat(entry_name, &info);
if(strncmp(name, entry_name, 127) == 0){
realpath(entry_name, path);
printf("path:%s\nlast modified:%s", path, ctime(&(info.st_mtime)));
}
//if next_entry is a dir, save current environment and explore the new dir
if(S_ISDIR(info.st_mode)){
getcwd(stk[stk_i].pathname, 512);
stk[stk_i].st_dir = curr_dir;
stk_i++;
chdir(entry_name);
curr_dir = opendir(".");
readdir(curr_dir);
readdir(curr_dir);
}
}
closedir(curr_dir);
return 0;
}
Then I removed from "myfind" the part where I printed the last modified date by removing all lstat calls. The execution time dropped to 30 seconds, and dropped to about 3-6 seconds on consecutive calls (program calls from terminal, not function calls. The improvement stayed there even after closing and reopening the terminal).
This behavior also improved execution time on consecutive calls that looked for files close in the directory tree, which seems a likely use scenario for this type of program, so I would like to be able to understand it more.
So what I'm actually asking is:
How does this improvement on execution time on consecutive calls work exactly? Why isn't it there in neither "find", nor "myfind" when using lstat?
Is there any other thing I could change in my code to make it more optimized, while making it also print the last modified date? I'm ok with making it platform dependent, too.
I'm operating on an SSD, and only looking for 1 filename every execution
Here is an implementation using fts_
as suggested by @EricPostpischil. It assumes user supplied a directory, if you want to verify that do the first fts_read()
before the loop and check fts_info
. It also assumes that the filename is a basename (doesn't contain any slashes). You could check that, too.
Look over the options to fts_open()
to ensure this is the behavior you want.
It's only checking that the filename matches a regular file; find -name
will also match against directory names. It's easy to address that just before the for
-loop.
#include <fts.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/types.h>
int myfind(char *const path, const char *filename) {
FTS *ftsp = fts_open(
(char *const []) { path, NULL},
FTS_LOGICAL | FTS_NOCHDIR | FTS_NOSTAT,
NULL
);
if(!ftsp) {
perror("");
goto err;
}
for(FTSENT *ftsent = fts_read(ftsp); ftsent; ftsent = fts_read(ftsp)) {
if(ftsent->fts_info & FTS_D)
for(ftsent = fts_children(ftsp, FTS_NAMEONLY); ftsent; ftsent = ftsent->fts_link)
if(!strcmp(filename, ftsent->fts_name))
printf("%s%s\n", ftsent->fts_path, ftsent->fts_name);
}
if(errno) {
perror("");
goto err;
}
fts_close(ftsp);
return EXIT_SUCCESS;
err:
if(ftsp)
fts_close(ftsp);
return EXIT_FAILURE;
}
int main(int argc, char *argv[]){
if(argc != 3) {
printf("usage: path filename\n");
return 0;
}
return myfind(argv[1], argv[2]);
}
and here are some example runs (with a warm file cache to be comparable):
$ time find . -name non-existing_file
real 0m5.296s
user 0m0.865s
$ time ./myfind . non-existing_file
real 0m4.259s
user 0m0.942s
$ time find . -name AnsiballZ_file.py
./.ansible/tmp/ansible-tmp-1670391304.3514712-709042-260172808017632/AnsiballZ_file.py
./.ansible/tmp/ansible-tmp-1658890524.6326995-109370-280029288970992/AnsiballZ_file.py
./.ansible/tmp/ansible-tmp-1567991519.7013745-230514371888007/AnsiballZ_file.py
./.ansible/tmp/ansible-tmp-1658888271.8413315-108112-177019459387621/AnsiballZ_file.py
real 0m5.445s
user 0m0.847s
sys 0m0.961s
$ time ./myfind . AnsiballZ_file.py
./.ansible/tmp/ansible-tmp-1670391304.3514712-709042-260172808017632/AnsiballZ_file.py
./.ansible/tmp/ansible-tmp-1658890524.6326995-109370-280029288970992/AnsiballZ_file.py
./.ansible/tmp/ansible-tmp-1567991519.7013745-230514371888007/AnsiballZ_file.py
./.ansible/tmp/ansible-tmp-1658888271.8413315-108112-177019459387621/AnsiballZ_file.py
real 0m7.481s
user 0m0.882s
sys 0m2.713s
I see ~3.9 to ~7.6 real time runs on my system so should be comparable to find.