Jobs not allocated in the endeavour cluster

hjcho · March 29, 2022, 6:32pm

Hi Cesar,

Yes, it seems to have been resolved, but there weren’t any notice of when and why this issue started and when/how it got resolved. I made sure that there were no other jobs that were queuing and it was clear that the gpus that I was requesting for were idle, as shown by the sinfo --partition=isi command, and my colleagues reported the same problem. Can we investigate why this happened to make sure this does not happen again or such that we can resolve it quickly next time it does happen? I believe this took more than a few days to get fixed (granted, these days included the weekend). If it helps at all, I checked the compute system status page and saw that none of the nodes in the isi partition seem to have any status shown, even those that I know have jobs running on them when using the squeue --partition=isi command. Let me know if I can help in any way to this end.

Also, the problem with the a11 nodes that are down is still there. It seems like only a11-02 is up while the other two a11-01 and a11-03, each with 8 a40 gpus, are down. Please let us know if there are any updates on this.

Thank you, and I really appreciate all the support that CARC provides.

Best,

Justin Cho