I am looking for the way to get per job memory usage information from Slurm using C API, namely memory used and memory reserved. I thought I could get such stats by calling slurm_load_jobs(…)
, but looking at job_step_info_t
type definition I could not see any relevant fields. Perhaps there could be something in job_resrcs
, but it is an opaque data type and I have no idea how to use it. Or is there another API call that would give me detailed memory usage info? Please advise.
This question was partially answered in this SO thread where the focus was only on the compiler errors. The missing portion of code was the loop through memory_allocated
and memory_used
arrays sized according to the number of hosts the job was dispatched to:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include "slurm/slurm.h"
#include "slurm/slurm_errno.h"
struct job_resources {
bitstr_t *core_bitmap;
bitstr_t *core_bitmap_used;
uint32_t cpu_array_cnt;
uint16_t *cpu_array_value;
uint32_t *cpu_array_reps;
uint16_t *cpus;
uint16_t *cpus_used;
uint16_t *cores_per_socket;
uint64_t *memory_allocated;
uint64_t *memory_used;
uint32_t nhosts;
bitstr_t *node_bitmap;
uint32_t node_req;
char *nodes;
uint32_t ncpus;
uint32_t *sock_core_rep_count;
uint16_t *sockets_per_node;
uint16_t *tasks_per_node;
uint8_t whole_node;
};
int main(int argc, char** argv)
{
int i, j, slurm_err;
uint64_t mem_alloc, mem_used;
job_info_msg_t *jobs;
/* Load job info from Slurm */
slurm_err = slurm_load_jobs((time_t) NULL, &jobs, SHOW_DETAIL);
printf("job_id,cluster,partition,user_id,name,job_state,mem_allocated,mem_used\n");
/* Print jobs info to the file in CSV format */
for (i = 0; i < jobs->record_count; i++)
{
mem_alloc = 0;
mem_used = 0;
for (j = 0; j < jobs->job_array[i].job_resrcs->nhosts; j++)
{
mem_alloc += jobs->job_array[i].job_resrcs->memory_allocated[j];
mem_used += jobs->job_array[i].job_resrcs->memory_used[0];
}
printf("%d,%s,%s,%d,%s,%d,%d,%d\n",
jobs->job_array[i].job_id,
jobs->job_array[i].cluster,
jobs->job_array[i].partition,
jobs->job_array[i].user_id,
jobs->job_array[i].name,
jobs->job_array[i].job_state,
mem_alloc,
mem_used
);
}
slurm_free_job_info_msg(jobs);
return 0;
}
This program compiles and runs without errors. One thing I noticed though is that mem_used
is either 0 or equal to mem_alloc
which sometimes differs from what I get from the sstat
command. I will have to investigate this further...