Hey! So I am trying to use the slurmR
package (which I maintain), and I’m having some strange issues when submitting jobs within a job (I know this sounds a bit weird but let me explain).
In the first level, I’m running a job that will process simulation data. The simulation data itself can be too much to handle on the login node, so I need to work with that on a compute node. The simulation data is generated by a call to parallel::parLapply
which uses a socket cluster, this cluster is created by submitting a job to Slurm and requesting ntasks
. Let me show you using code:
The main program, which I saved as an R script named permtest.R
, is something like this
library(slurmR)
library(parallel)
# A wrapper of makePSOCKcluster
# this will request resources by submitting a job that sleeps forever
# the job is stopped later on once I call `stopCluster(cl)`
cl <- makeSlurmCluster(100)
ans <- parLapply(cl, 1:1000, function(i) runif(100))
stopCluster(cl)
If I run this from the login node, it works as expected, but, if I allocate a compute node, open R and source the above script, after successfully obtaining resources from the line makeSlurmCluster(100)
, when trying to use the connection, I get this error:
ssh_exchange_identification: read: Connection reset by peer
If I try to submit this R script using sbatch
(like a nested job submission), I get this strange error:
srun: fatal: SLURM_MEM_PER_CPU, SLURM_MEM_PER_GPU, and SLURM_MEM_PER_NODE are mutually exclusive
slurmstepd: error: *** JOB 76826 ON d17-36 CANCELLED AT 2020-08-28T14:41:58 ***
Again, I got this error after submitting the job using this script:
#!/bin/sh
#SBATCH --job-name=Permutation-test
#SBATCH --mem=4GB
#SBATCH --time=01:00:00
module load usc
module load r/4.0.0
Rscript --vanilla /home1/vegayon/permtest.R
This works in the “old cluster”. Not sure why is not working here.