Search code examples
scalamongodbcasbah

Removing documents while preserving at least one


I have a MongoDB collection containing history data with id and timestamp.

I want to delete data from the collection older than a specific timestamp. But for every id at least one document (the newest) must stay in the collection.

Suppose I have the following documents in my collection ...

{"id" : "11", "timestamp" : ISODate("2011-09-09T10:27:34.785Z")} //1
{"id" : "11", "timestamp" : ISODate("2011-09-08T10:27:34.785Z")} //2

{"id" : "22", "timestamp" : ISODate("2011-09-05T10:27:34.785Z")} //3
{"id" : "22", "timestamp" : ISODate("2011-09-01T10:27:34.785Z")} //4

... and I want to delete documents having a timestamp older than 2011-09-07 then 1 and 2 should not be deleted because they are newer. 4 should be deleted because it is older, but 3 should not be deleted (although it is older) because at least one document per id should stay in the collection.

Does anyone know how I can do this with casbah and/or on the mongo console?

Regards, Christian


Solution

  • I can think of a couple of ways. First, try this:

    var cutoff = new ISODate("2011-09-07T00:00:00.000Z");
    db.testdata.find().forEach(function(data) {
        if (data.timestamp.valueOf() < cutoff.valueOf()) {
            // A candidate for deletion
            if (db.testdata.find({"id": data.id, "timestamp": { $gt: data.timestamp }}).count() > 0) {
                db.testdata.remove({"_id" : data._id});
             }
        }
    });
    

    This does the job you want. Or you can use a MapReduce job to do it as well. Load this into a text file:

    var map = function() {
        emit(this.id, {
            ref: this._id,
            timestamp: this.timestamp
        });
    };
    
    
    var reduce = function(key, values) {
        var cutoff = new ISODate("2011-09-07T00:00:00.000Z");
        var newest = null;
        var ref = null;
        var i;
        for (i = 0; i < values.length; ++i) {
            if (values[i].timestamp.valueOf() < cutoff.valueOf()) {
                // falls into the delete range
                if (ref == null) {
                    ref = values[i].ref;
                    newest = values[i].timestamp;
                } else if (values[i].timestamp.valueOf() > newest.valueOf()) {
                    // This one is newer than the one we are currently saving.
                    // delete ref
                    db.testdata.remove({_id : ref});
                    ref = values[i].ref;
                    newest = values[i].timestamp;
                } else {
                    // This one is older
                    // delete values[i].ref
                    db.testdata.remove({_id : values[i].ref});
                }
            } else if (ref == null) {
                ref = values[i].ref;
                newest = values[i].timestamp;
            }
        }
        return { ref: ref, timestamp: newest };
    };
    

    Load the above file into the shell: load("file.js");

    Then run it: db.testdata.mapReduce(map, reduce, {out: "results"});

    Then remove the mapReduce output: db.results.drop();