sometimes my pg_restore or mongorestore gets stuck

print 'restore db1'
run_command('pg_restore -i -h localhost -p 5432 -U postgres -d db1 -v "/var/lib/project/backup/db1.backup" -c')
print '--- wait 3 seconds'
time.sleep(3)

print 'restore db2'
run_command('pg_restore -i -h localhost -p 5432 -U postgres -d db2 -v "/var/lib/project/backup/db2.backup" -c')
print '--- wait 3 seconds'
time.sleep(3)

print 'restoring mongodb'
run_command('/var/lib/project/bbds/mongodb-linux/bin/mongorestore /var/lib/project/backup/dump --drop')
print '--- wait 3 seconds'
time.sleep(3)

My run_command is basically taken from this.

On my console:

python ignition.py --load-fixtures
For safety, this process will run for about one minute.
setup LDAP
restore db1
['pg_restore', '-i', '-h', 'localhost', '-p', '5432', '-U', 'postgres', '-d', 'db1', '-v', '/var/lib/project/backup/db1.backup', '-c']
Password:

It just stuck after entering password. Nothing happens afterward. I am not sure if it's resource busy. pg_restore is still running according ps au|grep pg_restore and mongorestore is not. So it must be stuck with the first restore. I don't think it's memory full either because my virtual machine has only 512mb and it's always full, and always uses the swap memory.

How do we know what it is doing? Sometimes it stucks on restoring mongo. So all these operations can cause problem. How should I troubleshoot this?

Thanks.

When I killed the process , I get this traceback: http://pastebin.com/Cnv9P6HW A reboot will "solve" the problem. It will allow me to run the script without problem. But that's not stable. We will end up in limbo sometimes later.

Solution

Your usage of pg_restore and mongorestore seems to be independent. I would try:

running each command directly from a command line (i.e. not via Python do_command())
putting these into separate scripts so you can work out which command is hanging

Your mention of only 512MB RAM and always using swap suggests there isn't enough free memory on this VM. What exactly does "stuck" mean? If the VM actually gets wedged and needs to be restarted, it most likely has run out of both RAM and swap. If you can still login but the restore script appears to be running for a longer than expected time, I would try to understand the resource usage using performance monitoring tools like iostat and vmstat.