Search code examples
algorithmdepth-first-searchbreadth-first-search

Should I use breadth first or depth first for searching a filesystem for a predetermined number of errors?


I have a large filesystem that I need to traverse for errors. Each file knows whether or not it contains an error, so I simply need to travel to each node and check whether there is an error there. Also, each directory knows the total number of errors that exists within, so a search can be terminated once the given number of errors is found, and a directory need not be traversed if it does not contain errors.

My question is whether the better solution to this would be to use a depth first or a breadth first search. The height of the tree is undefined, which I know usually makes BFS better, but given that we know if a directory will contain an error before traversing it, I am not sure if that advantage is mitigated.

NOTE: This is NOT a homework assignment. It is a requirement for a script that my boss has requested that I write.

EDIT 1: Time efficiency is far more important than space efficiency, as the script will primarily be run overnight, and therefore can essentially use all of the system memory, if necessary.

EDIT 2: Though is seems the popular answer is BFS for my problem, I am having trouble understanding why it would not be a DFS problem. Since (A) all errors need eventually be reached and (B) we know if a directory contains errors, BFS's protection against rabbit holes does not really apply. With that in mind, the only real difference seems to be space used, which would make DFS better. Can anyone give a good argument as to why this is not the case?


Solution

  • It depends on a few stuff.

    • Could your directories contain links and would you traverse the links? In that case, is it possible for the links to make a loop? In such a case, BFS makes more sense if you want to ignore the cycle-checking. Otherwise, it makes no difference.

    • How is the distribution of errors? Could it be that one directory contains most errors while others are almost empty of errors? In that case BFS is more likely to end sooner because it searches all directories little by little. In such a case, you would spend a long time with DFS in one huge directory tree that contains say 1 error in the very bottom leaves only to find out the next directory contained all the bugs you need right at level 1. If the errors are distributed more evenly, again, it doesn't matter what you use.

    • How big is your structure? If you have a tree with branching factor n (n subdirectories per each directory) and the tree has depth d, the BFS could take O(d^n) memory, while DFS could be written in such a way that it takes only O(d) memory (or in a simpler implementation O(d*n)) which in real huge directories could make a difference.

    My general feeling reading your question is BFS, but it is still you who have to decide based on the properties of your problem.