Does the slurm $TMPDIR environment variable point to a unqiue directory on Endeavour?

Using Endeavour for the first time today (QCB), and I ran into an issue which I have never experienced on Discovery. A bunch of my jobs crashed, and it seems like the reason is that they write a temporary file to the directory $TMPDIR (which I assumed was unique to that job), and delete it when they are done. On Endeavour, the crashed jobs were complaining that this temporary file didn’t exist when they went to remove it.

I can only reason that this is because jobs which are sharing the same node on Endeavour are also sharing the same $TMPDIR. If so, this results in clashes between jobs sharing the same node.

This this correct that $TMPDIR is shared by different jobs running on the same node? If so, is this behavior different on Discovery? What is the intention?

Thanks.

@sagendor The default TMPDIR is the local /tmp on both Discovery and Endeavour nodes, which is limited to 1 GB shared among jobs and can become full and cleared. This is actually implemented as a chunk of memory, not disk space, so we don’t want to use up too much memory on nodes.

That seems to be what you encountered. Usually sharing this /tmp space is fine, but if the jobs running happen to create large temporary files then you could encounter this error.

To change the location of your temporary files, set the TMPDIR variable. For example, make a /tmp subdirectory in one of your scratch directories and enter the following:

export TMPDIR=/scratch/<username>/tmp

You could also include this line in your ~/.bashrc to automatically set the variable every time you log in.