Search code examples
javamongodbspring-data-mongodbgridfs

GridFS creates orphan chunks when unique index is used


When attempting to upload a file with duplicate value of a metadata field that has a unique index on it, Spring framework's GridFsTemplate throws a MongoWriteException (E11000 duplicate key error) but also leaves orphan chunk documents in .chunks collection. The number of orphan chunk document is equal to number of normal chunk documents created when the file is uploaded the first time pointing to the fact that the whole file is uploaded before the unique index constraint is evaluated.

I have not found any official explanation about this behavior of GridFS and therefore, i think this maybe a bug as this causes needless bandwidth and storage waste which can be easily avoided.

Can anyone please help me see if my conclusion is correct or if anything can be done to avoid orphan chunk creation issue.


Solution

  • GridFS is provided by the driver to facilitate storing large files. When you instantiate a GridFS object, it will create 2 collections in the database, with names ending in .files and .chunks.

    It sets up these collections in the way that works correctly with the GridFS code.

    On the client side, GridFS splits the file into chunks smaller than 16MB, then inserts each of the chunks into the .chunks collection. It then inserts a document into the .files collection.

    Doing it in this order will prevent other clients from seeing that file until it is fully uploaded, so you won't have any partial downloads due to an incomplete upload.

    The unique index you created on the .files collections will cause the server to return an error when the GridFS code in the client attempts to insert the file document. Note that this is only attempted after all of the chunks have been inserted, so it does not prevent the file from being uploaded.

    This is unexpected behavior that would not occur with the collection the way the code created it, so the code does not have steps to clean up from this error, leaving the chunks of the file in the database.

    You manually modified the programmatically created collection, adding a unique index. It should not be surprising when the code that created that collection originally doesn't behave the same as it did before.

    If you need to prevent duplicate files from being uploaded, you will need to devise a way to check prior to uploading.