I am writing a C program which calculates the total size of files in a given directory. I know that each file points to an inode, so I am planning to use stat
to find the inode value and file size. Since I want to avoid erroneous calculation when there are multiple hard links and/or sym links to an inode, I want to store the inodes in an array. Problem is, now to check if the inode is unique for a given file, I would have to iterate through the inode array again, giving a runtime of approx n^2
. I want to avoid overly complex structures such as RB trees. Is there a faster, more clever way to implement this? I know there are system tools which does this, and I want to know how they implement something like this.
Even binary trees are a good choice since under random data they are relatively balanced. This is also a very simple structure to implement.
In general, the structure of choice is the hash table with constant average search time. The challenge here is to find a good hash function for your data. Implementation of hash tables is not difficult and I guess you could find a lot of good libraries implementing them.
But if you are willing to wait until you store all inodes in the array, then you can sort this array and traverse it in order to find duplicates..
EDIT:
Inodes contain a reference count. This counts the number of hard links. So you could check for duplicates among the inodes with reference count > 1.