Search code examples
sqldatabasechecksumflat-file

What's the best way to represent an editable checksum index of a filesystem?


Given any file, I want to identify duplicate files with identical checksums. I simply want to print a textual list of the duplicates to my terminal, so running a full desktop-search indexer would be absolute overkill.

To do what I want, I need to associatively store two pieces of information:

  1. "[This file] has <XYZ checksum>", for any file

  2. "[Here are all the files] for <XYZ checksum>", for any checksum

The catch is that I need the database to be editable so that, as I move data around - and notify the script managing the DB - it can keep up with the changes. There are two approaches I've thought of to do this.

One would be to use an offset-based flatfile index, but I would need to implement a fragmentation scheme so I could extend directory representations with new files as data was moved around, without having to constantly bitshift the entire file forwards and backwards to accommodate the data restructuring.
While not impossible for me to do, this would be sufficiently nontrivial that I would have concerns about database consistency and integrity, and since I'd be making decisions about whether to delete data off the back of this information I'd be playing with such an implementation for a while before I settled with using it.

I have no experience with using traditional databases, but I'm imagining SQL can probably achieve what I'm trying to do with significantly more ease than implementing an entire storage framework myself. If this is an option, where would be a good place to start? I'd tentatively theorize that I could create two tables: the first would list each checksum in the first (primary) column and a NUL-separated string of each file this checksum matched in the 2nd column; the second table would list the full path to the file in the first/primary column, and its checksum in the second. Updates to this system would require I simply modify a column in two tables, and be significantly simpler/easier than the method suggested above.


Solution

  • You need a database table with 2 columns: File and ChecSum. File (presented as full path) is UNIQUE by nature and can be used as UNIQUE INDEX. You may still want to add ID field (integer code) as PRIMARY KEY, especially if you want to treat moving or renaming as a single operation and not to split it into delete followed by create.

    In MySQL (used as an example) you will have something like this (without ID column):

    DDL and DML:

    CREATE TABLE Files
        (`File` varchar(16), `CheckSum` int)
    ;
    
    ALTER TABLE Files ADD UNIQUE INDEX (File), ADD INDEX (CheckSum);
    
    INSERT INTO Files
        (`File`, `CheckSum`)
    VALUES
        ('\dir1\file1', 56789),
        ('\dir2\file2', 77777),
        ('\dir3\dir4\file9', 56789),
        ('\dirA\file1', 12345)
    ;
    

    DOL:

    -- All files
    SELECT * 
    FROM Files;
    
    -- All files with checksum = 56789
    SELECT * 
    FROM Files
    WHERE checksum = 56789;
    
    -- File name '\dirA\file1'
    SELECT *
    FROM Files 
    WHERE file = '\dirA\file1';
    

    SQL Fiddle with the above