Search code examples
juliampidistributed-computing

Julia Distributed, redundant iterations appearing


I ran

mpiexec -n $nprocs julia --project myfile.jl 

on a cluster, where myfile.jl has the following form

using Distributed; using Dates; using JLD2;  using LaTeXStrings
@everywhere begin
using SharedArrays; using QuantumOptics; using LinearAlgebra; using Plots; using Statistics; using DifferentialEquations; using StaticArrays
#Defining some other functions and SharedArrays to be used later e.g.
MySharedArray=SharedArray{SVector{Nt,Float64}}(Np,Np)
end
@sync @distributed for pp in 1:Np^2
  for jj in 1:Nj 
  #do some stuff with local variables
  for tt in 1:Nt
  #do some stuff with local variables
  end
  end
  MySharedArray[pp]=... #using linear indexing
  println("$pp finished")
end

timestr=Dates.format(Dates.now(), "yyyy-mm-dd-HH:MM:SS")
filename="MyName"*timestr

@save filename*".jld2"

#later on, some other small stuff like making and saving a figure. (This does give an error "no method matching heatmap_edges(::Surface{Array{Float64,2}}, ::Symbol)" but I think that this is a technical thing about Plots so not very related to the bigger issue here)

However, when looking at the output, there are a few issues that make me conclude that something is wrong

  • The "$pp finished" output is repeated many times for each value of pp. It seems that this amount is actually equal to 32=$nprocs
  • Despite the code not being finished, "MyName" files are generated. It should be one, but I get a dozen of them with different timestr component

EDIT: two more things that I can add

  • the output of the different "MyName" files is not identical, but this is expected since random numbers are used in the inner loops. There are 28 of them, a number that I don't easily recognize except that its again close to the 32 $nprocs
  • earlier, I wrote that the walltime was exceeded, but this turns out not to be true. The .o file ends with "BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES ... EXIT CODE :9", pretty shortly after the last output file.

$nprocs is obtained in the pbs script through

#PBS -l select=1:ncpus=32:mpiprocs=32
nprocs= `cat $PBS_NODEFILE|wc -l`

Solution

  • As pointed out by adamslc on the Julia discourse, the proper way to use Julia on a cluster is to either

    • Start a session with one core from the job script, add more with addprocs() in the Julia script itself
    • Use more specialized Julia packages

    https://discourse.julialang.org/t/julia-distributed-redundant-iterations-appearing/57682/3