Search code examples
pythonlinuxmongodbmongodb-replica-setopenedx

Open edX and split mongo consumes disk space


I am using Open edX that uses MongoDB to store courses. we are running three-node replica set. It is currently using split mongo - a feature that makes a copy of the current document (backup) before editing. As time passes by, this piles up, resulting in the consumption of large disk space. There are currently around 30 courses and when I export it, it consumes around 2-3 GB. However, the disk space it is actually using is

I tried to clean the unwanted courses using this script

Upon executing this in primary member, it takes some time and deletes all the unwanted documents. But it does not release the disk space.

rs0:SECONDARY> db.stats()
{
        "db" : "edxapp",
        "collections" : 5,
        "objects" : 277557,
        "avgObjSize" : 112645.21484235671,
        "dataSize" : 31265467896,
        "storageSize" : 57843929088,
        "numExtents" : 0,
        "indexes" : 6,
        "indexSize" : 6938624,
        "ok" : 1
}


root@mongo:~# df
Filesystem     1K-blocks     Used Available Use% Mounted on
udev             4082828       12   4082816   1% /dev
tmpfs             817564      396    817168   1% /run
/dev/xvda1       8115168  1805528   5874364  24% /
none                   4        0         4   0% /sys/fs/cgroup
none                5120        0      5120   0% /run/lock
none             4087804        0   4087804   0% /run/shm
none              102400        0    102400   0% /run/user
/dev/xvdf       62904320 57542660   5361660  92% /edx
/dev/xvdh       72117576    53012  68378164   1% /tmp/repairdb

I tried to compact the DB using

rs0:SECONDARY> db.runCommand( { compact : 'modulestore.structures', force: 'true' } )
{ "ok" : 1 }

It didn't help either.

Could someone please let me know how to reclaim the disk space in such a situation? I want to do this in prod server as fast as I can.


Solution

  • You need to do the initial sync. One secondary at the time and finally step down your primary and do an initial sync on that too.

    So, you stop secondary and then remove all files from nodes dbPath. Start node and let it do the initial sync. Repeat this to all nodes.