Discovery slurm job logs are missing when using epyc-64 partition

I am submitting an Array Job running from 0-154 in discovery cluster.
I am noticing that few log files are missing
For example, the log files for 0,1,2,3 are missing out of the 154 log files.
All the jobs are running and they are not in pending state.

Also noticing that certain jobs in the array are getting stuck forever, typically these jobs take 200 seconds to run.
I see some of them taking more than 20 minutes.
I tried running these jobs manually, they take only the expected 200 seconds.
Is there anything wrong with certain nodes ?

10222509_0 epyc-64 filter.j arunbaal R 12:24 1 a03-19
10222509_1 epyc-64 filter.j arunbaal R 12:24 1 a03-19

(base) [arunbaal@discovery2 scripts]$ ls slurm-10222509_*
slurm-10222509_100.out slurm-10222509_129.out slurm-10222509_17.out slurm-10222509_46.out slurm-10222509_74.out
slurm-10222509_101.out slurm-10222509_12.out slurm-10222509_18.out slurm-10222509_47.out slurm-10222509_75.out
slurm-10222509_102.out slurm-10222509_130.out slurm-10222509_19.out slurm-10222509_48.out slurm-10222509_76.out
slurm-10222509_103.out slurm-10222509_131.out slurm-10222509_20.out slurm-10222509_49.out slurm-10222509_77.out
slurm-10222509_104.out slurm-10222509_132.out slurm-10222509_21.out slurm-10222509_4.out slurm-10222509_78.out
slurm-10222509_105.out slurm-10222509_133.out slurm-10222509_22.out slurm-10222509_50.out slurm-10222509_79.out
slurm-10222509_106.out slurm-10222509_134.out slurm-10222509_23.out slurm-10222509_51.out slurm-10222509_7.out
slurm-10222509_107.out slurm-10222509_135.out slurm-10222509_24.out slurm-10222509_52.out slurm-10222509_80.out
slurm-10222509_108.out slurm-10222509_136.out slurm-10222509_25.out slurm-10222509_53.out slurm-10222509_81.out
slurm-10222509_109.out slurm-10222509_137.out slurm-10222509_26.out slurm-10222509_54.out slurm-10222509_82.out
slurm-10222509_10.out slurm-10222509_138.out slurm-10222509_27.out slurm-10222509_55.out slurm-10222509_83.out
slurm-10222509_110.out slurm-10222509_139.out slurm-10222509_28.out slurm-10222509_56.out slurm-10222509_84.out
slurm-10222509_111.out slurm-10222509_13.out slurm-10222509_29.out slurm-10222509_57.out slurm-10222509_85.out
slurm-10222509_112.out slurm-10222509_140.out slurm-10222509_2.out slurm-10222509_58.out slurm-10222509_86.out
slurm-10222509_113.out slurm-10222509_141.out slurm-10222509_30.out slurm-10222509_59.out slurm-10222509_87.out
slurm-10222509_114.out slurm-10222509_142.out slurm-10222509_31.out slurm-10222509_5.out slurm-10222509_88.out
slurm-10222509_115.out slurm-10222509_143.out slurm-10222509_32.out slurm-10222509_60.out slurm-10222509_89.out
slurm-10222509_116.out slurm-10222509_144.out slurm-10222509_33.out slurm-10222509_61.out slurm-10222509_8.out
slurm-10222509_117.out slurm-10222509_145.out slurm-10222509_34.out slurm-10222509_62.out slurm-10222509_90.out
slurm-10222509_118.out slurm-10222509_146.out slurm-10222509_35.out slurm-10222509_63.out slurm-10222509_91.out
slurm-10222509_119.out slurm-10222509_147.out slurm-10222509_36.out slurm-10222509_64.out slurm-10222509_92.out
slurm-10222509_11.out slurm-10222509_148.out slurm-10222509_37.out slurm-10222509_65.out slurm-10222509_93.out
slurm-10222509_120.out slurm-10222509_149.out slurm-10222509_38.out slurm-10222509_66.out slurm-10222509_94.out
slurm-10222509_121.out slurm-10222509_14.out slurm-10222509_39.out slurm-10222509_67.out slurm-10222509_95.out
slurm-10222509_122.out slurm-10222509_150.out slurm-10222509_3.out slurm-10222509_68.out slurm-10222509_96.out
slurm-10222509_123.out slurm-10222509_151.out slurm-10222509_40.out slurm-10222509_69.out slurm-10222509_97.out
slurm-10222509_124.out slurm-10222509_152.out slurm-10222509_41.out slurm-10222509_6.out slurm-10222509_98.out
slurm-10222509_125.out slurm-10222509_153.out slurm-10222509_42.out slurm-10222509_70.out slurm-10222509_99.out
slurm-10222509_126.out slurm-10222509_154.out slurm-10222509_43.out slurm-10222509_71.out slurm-10222509_9.out
slurm-10222509_127.out slurm-10222509_15.out slurm-10222509_44.out slurm-10222509_72.out
slurm-10222509_128.out slurm-10222509_16.out slurm-10222509_45.out slurm-10222509_73.out
(base) [arunbaal@discovery2 scripts]$ ls slurm-10222509_* | grep slurm_1022509
(base) [arunbaal@discovery2 scripts]$ ls slurm-10222509_* | grep "0.out"
(base) [arunbaal@discovery2 scripts]$ ls slurm-10222509
* | grep "10.out"
slurm-10222509_10.out
(base) [arunbaal@discovery2 scripts]$ ls slurm-10222509
* | grep "0.out"
(base) [arunbaal@discovery2 scripts]$ ls slurm-10222509
* | grep “_1.out”

Hi there,

It looks like only jobs 0 and 1 failed.

$ sacct --format="JobId,JobName%15,ReqTres,Start,Elapsed,State" -u arunbaal -j 10222509 -S 2022-07-24 
JobID                JobName    ReqTRES               Start    Elapsed      State 
------------ --------------- ---------- ------------------- ---------- ---------- 
10222509_0        filter.job billing=1+ 2022-07-27T19:05:44   00:12:28     FAILED 
10222509_0.+           batch            2022-07-27T19:05:44   00:12:28     FAILED 
10222509_0.+          extern            2022-07-27T19:05:44   00:12:29  COMPLETED 
10222509_1        filter.job billing=1+ 2022-07-27T19:05:44   00:12:28     FAILED 
10222509_1.+           batch            2022-07-27T19:05:44   00:12:28     FAILED 
10222509_1.+          extern            2022-07-27T19:05:44   00:12:29  COMPLETED 
10222509_2        filter.job billing=1+ 2022-07-27T19:05:44   00:03:31  COMPLETED 
.
.
.

Not sure why you aren’t seeing log files though. You can always try resubmitting those jobs individually and see if they work.

Or you can try running in an interactive mode and replicating the same environment to see what the problem might be.

Best,
Cesar