Cromwell is not dealing well with singularity on the computing nodes with disabled internet.
I am not able to run or see your workflow, and error messages are telling me that singularity tries to build the container which should not occur unless the config file is incorrect
I got it to work today with my test case from gatk. While it still requires providing docker image URL, it actually uses downloaded image that was converted to sif
the steps to reproduce my setup are below (I adjusted the files to include your username and account):
Cromwell config (this version includes using sif files directly) : cromwellslurmsingularitynew.conf
# This line is required. It pulls in default overrides from the embedded cromwell
# `reference.conf` (in core/src/main/resources) needed for proper performance of cromwell.
include required(classpath("application"))
# Cromwell HTTP server settings
webservice {
#port = 8000
#interface = 0.0.0.0
#binding-timeout = 5s
#instance.name = "reference"
}
# Cromwell "system" settings
system {
# If 'true', a SIGINT will trigger Cromwell to attempt to abort all currently running jobs before exiting
#abort-jobs-on-terminate = false
# If 'true', a SIGTERM or SIGINT will trigger Cromwell to attempt to gracefully shutdown in server mode,
# in particular clearing up all queued database writes before letting the JVM shut down.
# The shutdown is a multi-phase process, each phase having its own configurable timeout. See the Dev Wiki for more details.
#graceful-server-shutdown = true
# Cromwell will cap the number of running workflows at N
#max-concurrent-workflows = 5000
# Cromwell will launch up to N submitted workflows at a time, regardless of how many open workflow slots exist
#max-workflow-launch-count = 50
# Number of seconds between workflow launches
#new-workflow-poll-rate = 20
# Since the WorkflowLogCopyRouter is initialized in code, this is the number of workers
#number-of-workflow-log-copy-workers = 10
# Default number of cache read workers
#number-of-cache-read-workers = 25
io {
# throttle {
# # Global Throttling - This is mostly useful for GCS and can be adjusted to match
# # the quota availble on the GCS API
# #number-of-requests = 100000
# #per = 100 seconds
# }
# Number of times an I/O operation should be attempted before giving up and failing it.
#number-of-attempts = 5
}
# Maximum number of input file bytes allowed in order to read each type.
# If exceeded a FileSizeTooBig exception will be thrown.
input-read-limits {
#lines = 128000
#bool = 7
#int = 19
#float = 50
#string = 128000
#json = 128000
#tsv = 128000
#map = 128000
#object = 128000
}
abort {
# These are the default values in Cromwell, in most circumstances there should not be a need to change them.
# How frequently Cromwell should scan for aborts.
scan-frequency: 30 seconds
# The cache of in-progress aborts. Cromwell will add entries to this cache once a WorkflowActor has been messaged to abort.
# If on the next scan an 'Aborting' status is found for a workflow that has an entry in this cache, Cromwell will not ask
# the associated WorkflowActor to abort again.
cache {
enabled: true
# Guava cache concurrency.
concurrency: 1
# How long entries in the cache should live from the time they are added to the cache.
ttl: 20 minutes
# Maximum number of entries in the cache.
size: 100000
}
}
# Cromwell reads this value into the JVM's `networkaddress.cache.ttl` setting to control DNS cache expiration
dns-cache-ttl: 3 minutes
}
docker {
hash-lookup {
# Set this to match your available quota against the Google Container Engine API
#gcr-api-queries-per-100-seconds = 1000
# Time in minutes before an entry expires from the docker hashes cache and needs to be fetched again
#cache-entry-ttl = "20 minutes"
# Maximum number of elements to be kept in the cache. If the limit is reached, old elements will be removed from the cache
#cache-size = 200
# How should docker hashes be looked up. Possible values are "local" and "remote"
# "local": Lookup hashes on the local docker daemon using the cli
# "remote": Lookup hashes on docker hub, gcr, gar, quay
#method = "remote"
enabled = "false"
}
}
# Here is where you can define the backend providers that Cromwell understands.
# The default is a local provider.
# To add additional backend providers, you should copy paste additional backends
# of interest that you can find in the cromwell.example.backends folder
# folder at https://www.github.com/broadinstitute/cromwell
# Other backend providers include SGE, SLURM, Docker, udocker, Singularity. etc.
# Don't forget you will need to customize them for your particular use case.
backend {
# Override the default backend.
default = slurm
# The list of providers.
providers {
# Copy paste the contents of a backend provider in this section
# Examples in cromwell.example.backends include:
# LocalExample: What you should use if you want to define a new backend provider
# AWS: Amazon Web Services
# BCS: Alibaba Cloud Batch Compute
# TES: protocol defined by GA4GH
# TESK: the same, with kubernetes support
# Google Pipelines, v2 (PAPIv2)
# Docker
# Singularity: a container safe for HPC
# Singularity+Slurm: and an example on Slurm
# udocker: another rootless container solution
# udocker+slurm: also exemplified on slurm
# HtCondor: workload manager at UW-Madison
# LSF: the Platform Load Sharing Facility backend
# SGE: Sun Grid Engine
# SLURM: workload manager
# Note that these other backend examples will need tweaking and configuration.
# Please open an issue https://www.github.com/broadinstitute/cromwell if you have any questions
slurm {
actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
config {
# Root directory where Cromwell writes job results in the container. This value
# can be used to specify where the execution folder is mounted in the container.
# it is used for the construction of the docker_cwd string in the submit-docker
# value above.
dockerRoot = "/cromwell-executions"
concurrent-job-limit = 10
# If an 'exit-code-timeout-seconds' value is specified:
# - check-alive will be run at this interval for every job
# - if a job is found to be not alive, and no RC file appears after this interval
# - Then it will be marked as Failed.
## Warning: If set, Cromwell will run 'check-alive' for every job at this interval
exit-code-timeout-seconds = 360
filesystems {
local {
localization: [
# soft link does not work for docker with --contain. Hard links won't work
# across file systems
"copy", "hard-link", "soft-link"
]
caching {
duplication-strategy: ["copy", "hard-link", "soft-link"]
hashing-strategy: "file"
}
}
}
#
runtime-attributes = """
Int runtime_minutes = 600
Int cpus = 2
Int requested_memory_mb_per_core = 8000
String? docker
String? partition
String? account
String? IMAGE
"""
submit = """
sbatch \
--wait \
--job-name=${job_name} \
--chdir=${cwd} \
--output=${out} \
--error=${err} \
--time=${runtime_minutes} \
${"--cpus-per-task=" + cpus} \
--mem-per-cpu=${requested_memory_mb_per_core} \
--partition=${partition} \
--account=${account} \
--wrap "/bin/bash ${script}"
"""
submit-docker = """
# SINGULARITY_CACHEDIR needs to point to a directory accessible by
# the jobs (i.e. not lscratch). Might want to use a workflow local
# cache dir like in run.sh
source /scratch2/rklotz/set_singularity_cachedir.sh
SINGULARITY_CACHEDIR=/scratch2/rklotz/singularity-cache
echo "SINGULARITY_CACHEDIR $SINGULARITY_CACHEDIR"
if [ -z $SINGULARITY_CACHEDIR ]; then
CACHE_DIR=$HOME/.singularity
else
CACHE_DIR=$SINGULARITY_CACHEDIR
fi
mkdir -p $CACHE_DIR
echo "SINGULARITY_CACHEDIR $SINGULARITY_CACHEDIR"
LOCK_FILE=$CACHE_DIR/singularity_pull_flock
# we want to avoid all the cromwell tasks hammering each other trying
# to pull the container into the cache for the first time. flock works
# on GPFS, netapp, and vast (of course only for processes on the same
# machine which is the case here since we're pulling it in the master
# process before submitting).
#flock --exclusive --timeout 1200 $LOCK_FILE \
# singularity exec --containall docker://${docker} \
# echo "successfully pulled ${docker}!" &> /dev/null
# Ensure singularity is loaded if it's installed as a module
#module load Singularity/3.0.1
# Build the Docker image into a singularity image
#IMAGE=${docker}.sif
#singularity build $IMAGE docker://${docker}
# Submit the script to SLURM
sbatch \
--wait \
--job-name=${job_name} \
--chdir=${cwd} \
--output=${cwd}/execution/stdout \
--error=${cwd}/execution/stderr \
--time=${runtime_minutes} \
${"--cpus-per-task=" + cpus} \
--mem-per-cpu=${requested_memory_mb_per_core} \
--partition=${partition} \
--account=${account} \
--wrap "singularity exec --containall --bind ${cwd}:${docker_cwd} ${IMAGE} ${job_shell} ${docker_script}"
"""
kill = "scancel ${job_id}"
check-alive = "squeue -j ${job_id}"
job-id-regex = "Submitted batch job (\\d+).*"
}
}
}
}
Workflow: gatktest.wdl
workflow helloCountBasesCaller {
call CountBasesCaller
}
task CountBasesCaller {
String GATKcontainer
String sampleName
String partition
String account
String IMAGE
File inputBAM
command {
gatk \
CountBases \
-I ${inputBAM} \
> ${sampleName}.txt
}
output {
File rawTXT = "${sampleName}.txt"
}
runtime {
docker: "${GATKcontainer}"
IMAGE: "${IMAGE}"
partition: "${partition}"
account: "${account}"
}
}
json file with inputs: gatktest_inputs.json
{
"helloCountBasesCaller.CountBasesCaller.inputBAM": "/project/biodb/NA12878.bam",
"helloCountBasesCaller.CountBasesCaller.sampleName": "outdata",
"helloCountBasesCaller.CountBasesCaller.GATKcontainer": "broadinstitute/gatk:4.2.3.0",
"helloCountBasesCaller.CountBasesCaller.partition": "main",
"helloCountBasesCaller.CountBasesCaller.account": "rklotz_600",
"helloCountBasesCaller.CountBasesCaller.IMAGE": "/scratch2/rklotz/singularity-cache/pull/gatk_4.2.3.0.sif"
}
Then prepare for test run
On the login node:
mkdir -p /scratch2/rklotz/singularity-cache/pull
mkdir -p /scratch2/rklotz/singularity-cache/pull
cd /scratch2/rklotz
cat set_singularity_cachedir.sh <<EOF
#!/bin/bash
export SINGULARITY_CACHEDIR=/scratch2/rklotz/singularity-cache
export SINGULARITY_TMPDIR=$SINGULARITY_CACHEDIR/tmp
export SINGULARITY_PULLDIR=$SINGULARITY_CACHEDIR/pull
export CWL_SINGULARITY_CACHE=$SINGULARITY_PULLDIR
EOF
source set_singularity_cachedir.sh
singularity pull docker://broadinstitute/gatk:4.2.3.0
Start an interactive session:
salloc --nodes=1 --ntasks=4 --cpus-per-task=4 --mem=16GB --account=rklotz_600 --partition=main --time=8:00:00
Inside the interactive session:
module purge
module load USC openjdk
java -jar -Dconfig.file=/scratch2/rklotz/cromwell/cromwellslurmsingularitynew.conf /scratch2/rklotz/cromwell/cromwell-71.jar run /scratch2/rklotz/cromwell/gatktest.wdl -i /scratch2/rklotz/cromwell/gatktest_inputs.json
If that goes well and you see something like this:
[2022-04-15 11:09:07,31] [info] WorkflowExecutionActor-b3cdada3-6972-4d8d-94f1-96874c9533b1 [b3cdada3]: Workflow helloCountBasesCaller complete. Final Outputs:
{
"helloCountBasesCaller.CountBasesCaller.rawTXT": "/home1/osinski/cromwell-executions/helloCountBasesCaller/b3cdada3-6972-4d8d-94f1-96874c9533b1/call-CountBasesCaller/execution/outdata.txt"
}
[2022-04-15 11:09:10,36] [info] WorkflowManagerActor: Workflow actor for b3cdada3-6972-4d8d-94f1-96874c9533b1 completed with status 'Succeeded'. The workflow will be removed from the workflow store.
[2022-04-15 11:09:12,95] [info] SingleWorkflowRunnerActor workflow finished with status 'Succeeded'.
{
"outputs": {
"helloCountBasesCaller.CountBasesCaller.rawTXT": "/home1/osinski/cromwell-executions/helloCountBasesCaller/b3cdada3-6972-4d8d-94f1-96874c9533b1/call-CountBasesCaller/execution/outdata.txt"
},
"id": "b3cdada3-6972-4d8d-94f1-96874c9533b1"
}
[2022-04-15 11:09:15,40] [info] Workflow polling stopped
Then once you adjust your WDL and json files (add IMAGE to the runtime section as in this test case) everything should work
Best regards,
Tomek