I need to produce statistics for files that are stored on a Linux network share and would like to be able run a shell script or program locally on the network share to produce data points with the following attributes:
path (or relativepath) | filename | filesize | datecreated | datechanged | dateaccessed
There are roughly 1–2 million files (8TB) and I want to explore the dataset to get a grasp of the organization and balance of the file types (as determined by a combination of file name and path) in relation to the total number of files and total amount of storage.
Questions:
What is an efficient way to traverse the file system and get this data?
What kind of database would you recommend to explore this kind of data with statistics at different levels in the hierarchy?
This is what I ended up using to solve the problem:
find
and fstat
were used generate the dataset as a plain text file.pandas
and exifread
libraries were used to enrich and analyze the dataset.