Creating a Socket Cluster with the slurmR R package

vegayon · August 28, 2020, 10:23pm

Hey! So I am trying to use the slurmR package (which I maintain), and I’m having some strange issues when submitting jobs within a job (I know this sounds a bit weird but let me explain).

In the first level, I’m running a job that will process simulation data. The simulation data itself can be too much to handle on the login node, so I need to work with that on a compute node. The simulation data is generated by a call to parallel::parLapply which uses a socket cluster, this cluster is created by submitting a job to Slurm and requesting ntasks. Let me show you using code:

The main program, which I saved as an R script named permtest.R, is something like this

library(slurmR)
library(parallel)
# A wrapper of makePSOCKcluster
# this will request resources by submitting a job that sleeps forever
# the job is stopped later on once I call `stopCluster(cl)`
cl <- makeSlurmCluster(100)
ans <- parLapply(cl, 1:1000, function(i) runif(100))
stopCluster(cl)

If I run this from the login node, it works as expected, but, if I allocate a compute node, open R and source the above script, after successfully obtaining resources from the line makeSlurmCluster(100), when trying to use the connection, I get this error:

ssh_exchange_identification: read: Connection reset by peer

If I try to submit this R script using sbatch (like a nested job submission), I get this strange error:

srun: fatal: SLURM_MEM_PER_CPU, SLURM_MEM_PER_GPU, and SLURM_MEM_PER_NODE are mutually exclusive
slurmstepd: error: *** JOB 76826 ON d17-36 CANCELLED AT 2020-08-28T14:41:58 ***

Again, I got this error after submitting the job using this script:

#!/bin/sh                                                                                      
#SBATCH --job-name=Permutation-test                                                            
#SBATCH --mem=4GB                                                                              
#SBATCH --time=01:00:00                                                                        
module load usc
module load r/4.0.0
Rscript --vanilla /home1/vegayon/permtest.R

This works in the “old cluster”. Not sure why is not working here.

dstrong · September 1, 2020, 8:36pm

On Discovery we do not allow users to ssh between compute nodes, so this is the root of the error. Trying to do so will produce the error ssh_exchange_identification: read: Connection reset by peer.

The makePSOCKcluster() function uses ssh to create and communicate with workers, and you can only choose between alternative ssh clients via the rshcmd option that spawns workers. So as far as I can tell this function simply will not work with multiple nodes on Discovery. Creating an MPI cluster should work, though of course this would add more complexity to your package.

Could you use the Slurm_lapply() function instead for this task?

vegayon · September 1, 2020, 8:47pm

I see. I can surely use Slurm_lapply() instead, it is more flexible and is not limited to ~120 instances of R as a Socket cluster is. The only issue with that is that using makePSOCKcluster() is easier to port. Is it possible to allow ssh between compute nodes provided that they share a job? The function makeSlurmCluster() only uses nodes that were allocated to the job.

I could try using an MPI cluster as well, but I would need to figure out how to do so, and furthermore, if I can do such without relying on other packages. Part of the beauty of the slurmR package is the fact that is dependency-free, so it runs out-of-the-box in any system with Slurm. If you feel like, and is useful for USC, I would be happy to discuss more about things to add/modify in the slurmR package .

dstrong · September 2, 2020, 1:08am

Technically that may be possible, but it would likely still cause problems because on Discovery we do not use host-based authentication for SSH (only key-based authentication). Supporting MPI would require dependencies, such as the Rmpi or pbdMPI packages, so it sounds like you probably don’t want to go that route. I will look into this further though.

dstrong · August 16, 2021, 10:05pm

@vegayon Perhaps you’ve discovered this already, but makeSlurmCluster() works now (with the mem argument). We enabled SSH between compute nodes.

vegayon · August 16, 2021, 11:26pm

And it is working on the dev version as well. Thanks!