I recently worked on a legacy parallel code on the Discovery cluster. Previously I have taken several test runs for this legacy code and the code works fun there. However, recently I encountered some problems. Sometimes when I submit a job, the slurm will allocate me some nodes like this:
However, this will lead to some weird problems: the error messages continuously told me that node d17-33 has invalid memory or it can not open some library files. Then I add a exclude option in my slurm script and exclude the node d17-33. the code runs exactly well look like this:
I am quite confused about that.
I am using gcc 8.3.0 and openmpi 4.0.2. Any suggestions with this? Thank you!