Search code examples
cfilesystems

What is the most efficient way to look for a file by name and print its attributes?


Sorry if the title isn't very fitting, I have actually multiple question, and I wasn't able to separate them.

I wrote a simplified version of the command line "find" program running on macOS that simply looks for every instance of a file name in a specified path and prints the corresponding file paths if any, and date of last modification. It's a single thread program. I used readdir and lstat to get the necessary file info.

Then I compared the performance of my program, let's call it "myfind"(printing as I said both file path and last modified time), to "find"(used with "find path -name filename", to print only the resulting file paths). "myfind" took 1m 40s and "find" just 40s. Both times were consistent during consecutive tries.

Unhappy about the result, I made some "optimizations" that didn't do anything at all (like removing recursion and using a simple stack holding some variables instead).

This is the code I wrote, error checking removed (executing with or without error checking doesn't change execution time):

struct recstruct{
    DIR *st_dir;
    char pathname[512];
};
//usage: ./execname path filename
int main(int argc, char *argv[]){
    chdir(argv[1]);
    myfind(argv[2]);
    return 0;
}
int myfind(char *name){
    DIR *curr_dir = opendir(".");
    int stk_i = 0;
    struct recstruct stk[128];
    struct dirent *next_entry;
    struct stat info;
    char *entry_name;
    char path[512];
    errno = 0;
    //ignores "." and ".."
    readdir(curr_dir);
    readdir(curr_dir);


    //gets next directory entry until it has fully explored the path
    while((next_entry = readdir(curr_dir)) != NULL || stk_i != 0){
        //goes back to previous directory if there is any and it's done with the current one
        if(next_entry == NULL){
            closedir(curr_dir);
            stk_i--;
            chdir(stk[stk_i].pathname);
            curr_dir = stk[stk_i].st_dir;
            continue;
        }
        //gets next directory entry and checks if its name is the one we are looking for
        entry_name = next_entry->d_name;
        lstat(entry_name, &info);
        if(strncmp(name, entry_name, 127) == 0){
            realpath(entry_name, path);
            printf("path:%s\nlast modified:%s", path, ctime(&(info.st_mtime)));
        }
        //if next_entry is a dir, save current environment and explore the new dir
        if(S_ISDIR(info.st_mode)){ 
            getcwd(stk[stk_i].pathname, 512);
            stk[stk_i].st_dir = curr_dir;
            stk_i++;
            chdir(entry_name);
            curr_dir = opendir(".");
            readdir(curr_dir);
            readdir(curr_dir);
        }
    }
    closedir(curr_dir);
    return 0;
}

Then I removed from "myfind" the part where I printed the last modified date by removing all lstat calls. The execution time dropped to 30 seconds, and dropped to about 3-6 seconds on consecutive calls (program calls from terminal, not function calls. The improvement stayed there even after closing and reopening the terminal).

This behavior also improved execution time on consecutive calls that looked for files close in the directory tree, which seems a likely use scenario for this type of program, so I would like to be able to understand it more.

So what I'm actually asking is:

  1. How does this improvement on execution time on consecutive calls work exactly? Why isn't it there in neither "find", nor "myfind" when using lstat?

  2. Is there any other thing I could change in my code to make it more optimized, while making it also print the last modified date? I'm ok with making it platform dependent, too.

I'm operating on an SSD, and only looking for 1 filename every execution


Solution

  • Here is an implementation using fts_ as suggested by @EricPostpischil. It assumes user supplied a directory, if you want to verify that do the first fts_read() before the loop and check fts_info. It also assumes that the filename is a basename (doesn't contain any slashes). You could check that, too.

    Look over the options to fts_open() to ensure this is the behavior you want.

    It's only checking that the filename matches a regular file; find -name will also match against directory names. It's easy to address that just before the for-loop.

    #include <fts.h>
    #include <errno.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <sys/stat.h>
    #include <sys/types.h>
    
    int myfind(char *const path, const char *filename) {
        FTS *ftsp = fts_open(
            (char *const []) { path, NULL},
            FTS_LOGICAL | FTS_NOCHDIR | FTS_NOSTAT,
            NULL
        );
        if(!ftsp) {
            perror("");
            goto err;
        }
        for(FTSENT *ftsent = fts_read(ftsp); ftsent; ftsent = fts_read(ftsp)) {
            if(ftsent->fts_info & FTS_D)
                for(ftsent = fts_children(ftsp, FTS_NAMEONLY); ftsent; ftsent = ftsent->fts_link)
                    if(!strcmp(filename, ftsent->fts_name))
                        printf("%s%s\n", ftsent->fts_path, ftsent->fts_name);
        }
        if(errno) {
            perror("");
            goto err;
        }
        fts_close(ftsp);
        return EXIT_SUCCESS;
    err:
        if(ftsp)
            fts_close(ftsp);
        return EXIT_FAILURE;
    }
    
    int main(int argc, char *argv[]){
        if(argc != 3) {
            printf("usage: path filename\n");
            return 0;
        }
        return myfind(argv[1], argv[2]);
    }
    

    and here are some example runs (with a warm file cache to be comparable):

    $ time find . -name non-existing_file
    real    0m5.296s
    user    0m0.865s
    
    $ time ./myfind . non-existing_file
    real    0m4.259s
    user    0m0.942s
    
    $ time find  . -name AnsiballZ_file.py
    ./.ansible/tmp/ansible-tmp-1670391304.3514712-709042-260172808017632/AnsiballZ_file.py
    ./.ansible/tmp/ansible-tmp-1658890524.6326995-109370-280029288970992/AnsiballZ_file.py
    ./.ansible/tmp/ansible-tmp-1567991519.7013745-230514371888007/AnsiballZ_file.py
    ./.ansible/tmp/ansible-tmp-1658888271.8413315-108112-177019459387621/AnsiballZ_file.py
    
    real    0m5.445s
    user    0m0.847s
    sys     0m0.961s
    
    $ time ./myfind . AnsiballZ_file.py
    ./.ansible/tmp/ansible-tmp-1670391304.3514712-709042-260172808017632/AnsiballZ_file.py
    ./.ansible/tmp/ansible-tmp-1658890524.6326995-109370-280029288970992/AnsiballZ_file.py
    ./.ansible/tmp/ansible-tmp-1567991519.7013745-230514371888007/AnsiballZ_file.py
    ./.ansible/tmp/ansible-tmp-1658888271.8413315-108112-177019459387621/AnsiballZ_file.py
    
    real    0m7.481s
    user    0m0.882s
    sys     0m2.713s
    

    I see ~3.9 to ~7.6 real time runs on my system so should be comparable to find.