Search code examples
pythonmongodbpymongohpcslurm

How to access and query mongodb on HPC


I would like to parallelise queries to a MongoDB database, using pymongo. I am using an HPC system, which uses Slurm as the workload manager. I have a setup which works fine on a single node, but fails when the tasks are spread across more than one node.

I know that the problem is that mongodb is bound to node I start it on, and therefore the additional nodes can't connect to it.

I specifically would like to know how to start and then connect to the mongodb server when using multiple HPC nodes. Thanks!

Some extra details:

Before starting my python script, I start the mongodb like this:

numactl --interleave=all mongod --dbpath=database &

And I get the warning message:

** WARNING: This server is bound to localhost.
**          Remote systems will be unable to connect to this server. 
**          Start the server with --bind_ip <address> to specify which IP 
**          addresses it should serve responses from, or with --bind_ip_all to
**          bind to all interfaces. If this behavior is desired, start the
**          server with --bind_ip 127.0.0.1 to disable this warning.

In my python script, I have a worker function which is run by each processor. It is basically structured like this:

def worker(args):
    cl = pymongo.MongoClient()
    db = cl.mydb
    collection = db['mycol']
    query = {}
    result = collection.find_one(query)
    # now do some work...

Solution

  • The warning message mentions --bind_ip <address>. To know the IP address of a compute node, the simplest solution is to use the hostname -i command. So in your submission script, try

    numactl --interleave=all mongod --dbpath=database --bind_ip $(hostname -i) &
    

    But then, your Python script must also know the IP address of the node on which MongoDB is running:

    def worker(args):
        cl = pymongo.MongoClient(host=<IP of MongoDB Server>)
        db = cl.mydb
        collection = db['mycol']
        query = {}
        result = collection.find_one(query)
        # now do some work...
    

    You will need to adapt the <IP of MongoDB Server> part depending on how you want to pass the information to the Python script. It can be through a command-line parameter, trough the environment, through a file, etc.

    Do not forget to use srun to run the python script on all nodes of the allocation, or you will need to implement that functionality in your python script itself.

    Do not hesitate also to change the default port of MongoDB from job to job to avoid possible interference if you have several of them running.