Jobs not allocated in the endeavour cluster

I’m trying to use GPU nodes in the ISI partition of the Endeavour cluster.

gpu nodes are shown as available (idle) when running the sinfo command, but I can’t get any jobs allocated when using sbatch or srun or salloc. Here is my output when I use sinfo --partition=isi -o "%A %N %G"

NODES(A/I) NODELIST GRES
1/0 a11-[01-03] gpu:a40:8(S:0-1)
1/5 d23-[01-06] gpu:p100:2(S:0-1)
0/2 d14-[01-02] gpu:v100:2(S:0-1)
0/5 e09-[13-17] gpu:k80:2(S:0)

Why am I not able to get these compute resources allocated when they are available?

Hi,

It looks like all jobs are running normally right now so I believe this is resolved. Let me know if otherwise. In regards to why this happened in the first place, it’s possible that other jobs with higher priority were queuing but it’s hard to say since it’s been a few days.

-Cesar

Hi Cesar,

Yes, it seems to have been resolved, but there weren’t any notice of when and why this issue started and when/how it got resolved. I made sure that there were no other jobs that were queuing and it was clear that the gpus that I was requesting for were idle, as shown by the sinfo --partition=isi command, and my colleagues reported the same problem. Can we investigate why this happened to make sure this does not happen again or such that we can resolve it quickly next time it does happen? I believe this took more than a few days to get fixed (granted, these days included the weekend). If it helps at all, I checked the compute system status page and saw that none of the nodes in the isi partition seem to have any status shown, even those that I know have jobs running on them when using the squeue --partition=isi command. Let me know if I can help in any way to this end.

Also, the problem with the a11 nodes that are down is still there. It seems like only a11-02 is up while the other two a11-01 and a11-03, each with 8 a40 gpus, are down. Please let us know if there are any updates on this.

Thank you, and I really appreciate all the support that CARC provides.

Best,

Justin Cho

does carc now use fairplay or some other dynamic priority algorithm (such that priority gets worse as you use more resources and better as time goes without you using them)? a long time ago hpc did but recently it appeared to be simple FCFS.