Search code examples
filefilesystemschecksum

Are there any file systems that do not use file paths?


File paths are inherently dubious when working with data. Lets say I have a hypothetical situation with a program called find_brca, and some data called my.genome and both are in the /Users/Desktop/ directory.

find_brca takes a single argument, a genome, runs for about 4 hours, and returns the probability of that individual developing breast cancer in their lifetime. Some people, presented with a very high % probability, might then immediately have both of their breasts removed as a precaution.

Obviously, in this scenario, it is absolutely vital that /Users/Desktop/my.genome actually contains the genome we think it does. There are no do-overs. "oops we used an old version of the file from a previous backup" or any other technical issue will not be acceptable to the patient. How do we ensure we are analysing the file we think we are analysing?

To make matters trickier, lets also assert that we cannot modify find_brca itself, because we didn't write it, its closed source, proprietary, whatever.

You might think MD5 or other cryptographic checksums might be able to come to the rescue, and while they do help to a degree, you can only MD5 the file before and/or after find_brca has run, but you can never know exactly what data find_brca used (without doing some serious low-level system probing with DTrace/ptrace, etc).

The root of the problem is that file paths do not have a 1:1 relationship with actual data. Only in a filesystem where files can only be requested by their checksum - and as soon as the data is modified its checksum is modified - can we ensure that when we feed find_brca the genome's file path 4fded1464736e77865df232cbcb4cd19, we are actually reading the correct genome.

Are there any filesystems that work like this? If I wanted to create such a filesystem because none currently exists, how would you recommend I go about doing it?


Solution

  • I have my doubts about the stability, but hashfs looks exactly like what you want: http://hashfs.readthedocs.io/en/latest/

    HashFS is a content-addressable file management system. What does that mean? Simply, that HashFS manages a directory where files are saved based on the file’s hash. Typical use cases for this kind of system are ones where: Files are written once and never change (e.g. image storage). It’s desirable to have no duplicate files (e.g. user uploads). File metadata is stored elsewhere (e.g. in a database).

    Note: Not to be confused with the hashfs, a student of mine did a couple of years ago: http://dl.acm.org/citation.cfm?id=1849837