CPU affinity: is it a bug or user config error?

tnarayan · October 5, 2020, 4:58pm

I am noticing that my multithreaded jobs are unusually slower on the discovery cluster.

Here is an example job with 22 CPUs

$ squeue -u tnarayan -S +i -o "%8i %8g %9u %5P %15j %2t %12M %12l %5D %3C %10R %10b %S"
JOBID    GROUP    USER      PARTI NAME    ST TIME         TIME_LIMIT   NODES CPU NODELIST(R TRES_PER_N START_TIME
349513   tnarayan tnarayan  main  bash    R  7:07         1-00:00:00   1     22  d14-06     N/A        2020-10-05T09:04:09

I run a stress command with 22 threads of CPU stress
$ stress -c 22

and I view CPU utilization:

F S UID         PID   PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
0 S tnarayan  18623  16423  0  80   0 -  1835 do_wai 09:13 pts/0    00:00:00 stress -c 22
1 R tnarayan  18624  18623  4  80   0 -  1835 -      09:13 pts/0    00:00:01 stress -c 22
1 R tnarayan  18625  18623  4  80   0 -  1835 -      09:13 pts/0    00:00:01 stress -c 22
1 R tnarayan  18626  18623  4  80   0 -  1835 -      09:13 pts/0    00:00:01 stress -c 22
1 R tnarayan  18627  18623  4  80   0 -  1835 -      09:13 pts/0    00:00:01 stress -c 22
1 R tnarayan  18628  18623  4  80   0 -  1835 -      09:13 pts/0    00:00:01 stress -c 22
1 R tnarayan  18629  18623  4  80   0 -  1835 -      09:13 pts/0    00:00:01 stress -c 22
1 R tnarayan  18630  18623  4  80   0 -  1835 -      09:13 pts/0    00:00:01 stress -c 22
1 R tnarayan  18631  18623  4  80   0 -  1835 -      09:13 pts/0    00:00:01 stress -c 22
1 R tnarayan  18632  18623  4  80   0 -  1835 -      09:13 pts/0    00:00:01 stress -c 22
1 R tnarayan  18633  18623  4  80   0 -  1835 -      09:13 pts/0    00:00:01 stress -c 22
1 R tnarayan  18634  18623  4  80   0 -  1835 -      09:13 pts/0    00:00:01 stress -c 22
1 R tnarayan  18635  18623  4  80   0 -  1835 -      09:13 pts/0    00:00:01 stress -c 22
1 R tnarayan  18636  18623  4  80   0 -  1835 -      09:13 pts/0    00:00:01 stress -c 22
1 R tnarayan  18637  18623  4  80   0 -  1835 -      09:13 pts/0    00:00:01 stress -c 22
1 R tnarayan  18638  18623  4  80   0 -  1835 -      09:13 pts/0    00:00:01 stress -c 22
1 R tnarayan  18639  18623  4  80   0 -  1835 -      09:13 pts/0    00:00:01 stress -c 22
1 R tnarayan  18640  18623  4  80   0 -  1835 -      09:13 pts/0    00:00:01 stress -c 22
1 R tnarayan  18641  18623  4  80   0 -  1835 -      09:13 pts/0    00:00:01 stress -c 22
1 R tnarayan  18642  18623  4  80   0 -  1835 -      09:13 pts/0    00:00:01 stress -c 22
1 R tnarayan  18643  18623  4  80   0 -  1835 -      09:13 pts/0    00:00:01 stress -c 22
1 R tnarayan  18644  18623  4  80   0 -  1835 -      09:13 pts/0    00:00:01 stress -c 22
1 R tnarayan  18645  18623  4  80   0 -  1835 -      09:13 pts/0    00:00:01 stress -c 22
0 R tnarayan  18995  17153  0  80   0 - 39397 -      09:14 pts/1    00:00:00 ps -afl

Each of them is getting only 4% of CPU.
22*4 is less than 100%; threfore, my multithreaded jobs arent really running in parallel – this is a problem.

Since I have requested 22 CPUs, why is it not making use of all 22 CPU cores?
This problem sounds familiar to me.
And a the way to work around is… use taskset

$ taskset 0xFFFFFFFFF stress -c 22

With this, I see now that CPUs are used as expected.

F S UID         PID   PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
0 S tnarayan  19248  16423  0  80   0 -  1835 do_wai 09:15 pts/0    00:00:00 stress -c 22
1 R tnarayan  19249  19248 99  80   0 -  1835 -      09:15 pts/0    00:01:25 stress -c 22
1 R tnarayan  19250  19248 99  80   0 -  1835 -      09:15 pts/0    00:01:25 stress -c 22
1 R tnarayan  19251  19248 99  80   0 -  1835 -      09:15 pts/0    00:01:25 stress -c 22
1 R tnarayan  19252  19248 99  80   0 -  1835 -      09:15 pts/0    00:01:25 stress -c 22
1 R tnarayan  19253  19248 99  80   0 -  1835 -      09:15 pts/0    00:01:25 stress -c 22
1 R tnarayan  19254  19248 99  80   0 -  1835 -      09:15 pts/0    00:01:25 stress -c 22
1 R tnarayan  19255  19248 99  80   0 -  1835 -      09:15 pts/0    00:01:25 stress -c 22
1 R tnarayan  19256  19248 99  80   0 -  1835 -      09:15 pts/0    00:01:25 stress -c 22
1 R tnarayan  19257  19248 99  80   0 -  1835 -      09:15 pts/0    00:01:25 stress -c 22
1 R tnarayan  19258  19248 99  80   0 -  1835 -      09:15 pts/0    00:01:25 stress -c 22
1 R tnarayan  19259  19248 99  80   0 -  1835 -      09:15 pts/0    00:01:25 stress -c 22
1 R tnarayan  19260  19248 99  80   0 -  1835 -      09:15 pts/0    00:01:25 stress -c 22
1 R tnarayan  19261  19248 99  80   0 -  1835 -      09:15 pts/0    00:01:25 stress -c 22
1 R tnarayan  19262  19248 99  80   0 -  1835 -      09:15 pts/0    00:01:25 stress -c 22
1 R tnarayan  19263  19248 99  80   0 -  1835 -      09:15 pts/0    00:01:25 stress -c 22
1 R tnarayan  19264  19248 99  80   0 -  1835 -      09:15 pts/0    00:01:25 stress -c 22
1 R tnarayan  19265  19248 99  80   0 -  1835 -      09:15 pts/0    00:01:25 stress -c 22
1 R tnarayan  19266  19248 99  80   0 -  1835 -      09:15 pts/0    00:01:25 stress -c 22
1 R tnarayan  19267  19248 99  80   0 -  1835 -      09:15 pts/0    00:01:25 stress -c 22
1 R tnarayan  19268  19248 99  80   0 -  1835 -      09:15 pts/0    00:01:25 stress -c 22
1 R tnarayan  19269  19248 99  80   0 -  1835 -      09:15 pts/0    00:01:25 stress -c 22
1 R tnarayan  19270  19248 99  80   0 -  1835 -      09:15 pts/0    00:01:25 stress -c 22
0 R tnarayan  19558  17153  0  80   0 - 39397 -      09:16 pts/1    00:00:00 ps -afl

But, I DONT like prefixing taskset 0xFFFFFFF everytime. very likely I forget it or miss it, and later regret it.

How to fix this issue without falling back to taskset?
Is there something I can add to my bashrc, or something admins can fix in OS/slurm configs?

Thanks

molguin · October 6, 2020, 7:53am

Are you using OpenMP threads? Can you share your slurm job script?

We provide template slurm job scripts for multi-threaded jobs at the following link which may be helpful for your case:
Link: https://carc.usc.edu/user-information/user-guides/hpc-basics/slurm-templates

tnarayan · October 6, 2020, 4:37pm

@molguin
I am using multithreading that is supported by pyspark, and I am not sure if it uses OpenMP or not.

My slurm job script is pretty complicated. So here is a simple thing (if this works, my job will work too):

$ srun -p epyc-64 -t 0-0:30:00 --mem=4G -N 1 -n 10 --pty bash

$ ~tnarayan/work/miniconda3/bin/stress -c  10 
# view htop on another terminal on the same node while stress is running

The behavior is not uniform across all the nodes in the cluster.
e.g. epyc-64 has this issue for sure. Some nodes in main partition have this issue while others have not. Please try this on a couple of nodes to confirm.

Thanks

molguin · October 6, 2020, 7:56pm

The slurm resource manager has a lot of powerful features for task affinity. You can find very good documentation at the following link:
Link: https://slurm.schedmd.com/cpu_management.html

Can you try the following srun command:
srun -p epyc-64 -t 00:30:00 --mem=4G -N 1 -n 1 -c 10 --pty bash

The above is equivalent to the following SBATCH slurm directives:
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=10

tnarayan · October 7, 2020, 3:32am

Yes, this solved my problem!
We should use -n 1 -c 10 instead of -n 10 (where -c 1 is default, and that’s why cpu was capped at 100%)
This explains a lot, and I learned how powerful slurm’s task affinity is.

Thank you x 10000
-TG