Search code examples
linuxcentoscluster-computinghpccondor

I could not submit a job to the executing node in condor apart from the central manager


I have a condor pool which consist of 4 dedicated machine one is set as a centeral manager, submitting, and executing node while the other three is set to be executing nodes I used CentOS 5.4 as an OS for all the machines. My problem is when I submitted a job from the central manager it works just on the central manager so when I specify in the JDL file that the job should run in any machine apart from the central manager the job stay in hold and does not run. When I type condor_status all nodes appear. I keep the daemon MASTER, STARTD in the daemon list for the executing nodes. Does any one come across this problem?


Solution

  • There's not enough information to answer your question, but the first thing to do is to run condor_q -analyze <jobid> and see what it tells you. See the Condor manual Section 2.6.5: Why is the job not running?

    One possible cause is that you're not telling Condor to transfer your input/output files for you, and your nodes have different "filesystem domains", so Condor is unable to find a host which shares a common filesystem with your submit host.