Search code examples
capimemoryhpcslurm

How to get memory usage information using Slurm C API?


I am looking for the way to get per job memory usage information from Slurm using C API, namely memory used and memory reserved. I thought I could get such stats by calling slurm_load_jobs(…), but looking at job_step_info_t type definition I could not see any relevant fields. Perhaps there could be something in job_resrcs, but it is an opaque data type and I have no idea how to use it. Or is there another API call that would give me detailed memory usage info? Please advise.


Solution

  • This question was partially answered in this SO thread where the focus was only on the compiler errors. The missing portion of code was the loop through memory_allocated and memory_used arrays sized according to the number of hosts the job was dispatched to:

    #include <stdio.h>
    #include <stdlib.h>
    #include <unistd.h>
    #include "slurm/slurm.h"
    #include "slurm/slurm_errno.h"
    
    
    struct job_resources {
            bitstr_t *core_bitmap;
            bitstr_t *core_bitmap_used;
            uint32_t  cpu_array_cnt;
            uint16_t *cpu_array_value;
            uint32_t *cpu_array_reps;
            uint16_t *cpus;
            uint16_t *cpus_used;
            uint16_t *cores_per_socket;
            uint64_t *memory_allocated;
            uint64_t *memory_used;
            uint32_t  nhosts;
            bitstr_t *node_bitmap;
            uint32_t  node_req;
            char     *nodes;
            uint32_t  ncpus;
            uint32_t *sock_core_rep_count;
            uint16_t *sockets_per_node;
            uint16_t *tasks_per_node;
            uint8_t   whole_node;
    
    };
    
    int main(int argc, char** argv)
    {
            int i, j, slurm_err;
            uint64_t mem_alloc, mem_used;
            job_info_msg_t *jobs;
    
            /* Load job info from Slurm */
            slurm_err = slurm_load_jobs((time_t) NULL, &jobs, SHOW_DETAIL);
            printf("job_id,cluster,partition,user_id,name,job_state,mem_allocated,mem_used\n");
            /* Print jobs info to the file in CSV format */
            for (i = 0; i < jobs->record_count; i++)
            {
                    mem_alloc = 0;
                    mem_used = 0;
                    for (j = 0; j < jobs->job_array[i].job_resrcs->nhosts; j++)
                    {
                            mem_alloc += jobs->job_array[i].job_resrcs->memory_allocated[j];
                            mem_used  += jobs->job_array[i].job_resrcs->memory_used[0];
                    }
                    printf("%d,%s,%s,%d,%s,%d,%d,%d\n",
                            jobs->job_array[i].job_id,
                            jobs->job_array[i].cluster,
                            jobs->job_array[i].partition,
                            jobs->job_array[i].user_id,
                            jobs->job_array[i].name,
                            jobs->job_array[i].job_state,
                            mem_alloc,
                            mem_used
                    );
            }
            slurm_free_job_info_msg(jobs);
            return 0;
    }
    

    This program compiles and runs without errors. One thing I noticed though is that mem_used is either 0 or equal to mem_alloc which sometimes differs from what I get from the sstat command. I will have to investigate this further...