Manual process placement on multi-processor machines
Why process placement matters
The Linux kernel does a poor job of assigning processes to CPUs if you care about HPC performance. It is biased towards optimizing operating systems with lots of idle processes. For best performance of CPU-intensive user jobs, you need to manage process placement manually.
This article explains the problem and then gives specific instructions for some of the EGFD group's machines. A more detailed discussion of machine topology is included.
Modern machine architecture
Modern machines have multiple CPUs (or sockets) each with multiple cores, and lots of memory. All the CPUs can see all the memory, so it is easy to think of the machine in a simple linear or flat topology such as a single bus with all CPUs and all memory attached to that bus. At a high level that's how it appears to work. But for best performance, the actual topology is very significant.
In a multi-processor system such as hood.math (four 8-core CPUs) each CPU has some memory that is topologically close to it. The memory belonging to other CPUs is farther away, hence slower to access. This is called NUMA -- non-uniform memory architecture. In a large multi-processor system such as kazan.math (64 single-core CPUs) the CPUs are also organized such that some are close to each other and some are farther away. Performance is better if a CPU uses the memory that's closest to it, and if a multi-process job uses CPUs that are close to one another.
Furthermore, the OS tends to switch processes around from core to core or CPU to CPU rather than keeping them on the same CPU. Performance is better if you can pin a process to a core.
Finally, the OS tends to run several processes on the same CPU (or even the same core within a CPU) rather than spreading them around. Performance is better if processes are spread around with no contention on the same core (or even the same CPU).
Here's a nice article that illustrates this:
Of course, performance is also better if you do not run more jobs than there are CPU cores.
To manage process placement yourself, you need to know
- how many threads (processes) your job uses
- how much memory your job uses
- how many CPUs and how much memory the machine has
- which CPUs are busy already when you want to run something
- how to force your processes to use specific CPUs
Some of these depend on the particular machine and operating system.
Job placement on winisk and kazan
Winisk and kazan have complicated topologies. If you really want to get the most from these machines, it helps to understand the topologies and locate processes accordingly within the machines. Details of this are at the end of this article. Knowing how to place a job on a particular CPU may be enough for most purposes.
winisk
- 32 single-core Itanium2 CPUs, 192 GB RAM
- one module is offline, so total of 30 CPUs, 180 GB RAM available
- two CPUs and their local memory are reserved for the OS
- 28 CPUs and 168 GB RAM available for user jobs
kazan
- 64 single-core Itanium2 CPUs, 128 GB memory
- two CPUs and their four GB of local memory are reserved for the OS
- 62 CPUs and 124 GB RAM available for user jobs
CPU sets
Both winisk and kazan use "cpusets" to separate OS activity from user activity. Each machine has two CPUs and their local memory reserved for the OS and the rest for user processes.
Viewing activity
To see which CPUs are busy and which are free, use the pmshub(1) command. Start with
% pmshub -A
to activate, then
% pmshub &
to get a graphical representation of all the nodes in the machine. A blue bar-graph in each CPU cell tells how busy that CPU is. Free memory as well as cache activity are also shown. Node 0 (two CPUs) is reserved for the OS.
Pinning a job on a CPU
From the pmshub display you can see which CPUs are free. Use the dplace(1) command to pin a process to a CPU. E.g.
% dplace -c 5 a.out
NOTE: dplace CPU numbering is relative to the cpuset, not to the physical machine. Since CPUs 0 and 1 are reserved for the OS (on both winisk and kazan), this example puts the job on CPU number 5 within the cpuset which is CPU number 7 in the physical system.
Many more options are available for dplace.
Job placement on hood
Hood.math has four eight-core CPUs (sockets). Each CPU has eight cores. The CPUs are Intel Xeon E5-4640 which support hyperthreading (two threads per core) but we have turned hyperthreading off for better performance. Thus the system has a total of 32 cores available. Running more than 32 CPU-intensive processes at once will reduce performance. Total memory is 256 GB.
Each of the four CPUs has 64 GB local memory. Accessing memory from another CPU is somewhat slower but not as pronounced as in winisk and kazan.
There are no cpusets on hood.math. All CPUs and memory are available to the OS and to users.
Viewing activity
There is no program like pmshub to visualize CPU activity. Use the getfreecore command for a report on which CPUs seem busy or available. (This is a slightly modified version of the program getfreesocket described on the web page mentioned above.)
The numactl(1) (NUMA control) command can describe the CPUs and memory topology but cannot show what's busy or free.
% numactl --hardware
Pinning a job on a CPU
Use the numactl(1) command to pin a process to a CPU. For example,
% numactl --physcpubind=7 ./a.out
binds the a.out job to core number 7.
% numactl --cpunodebind=2 --membind=2,3 ./a.out
binds the a.out job to node number 2 and the memory of nodes 2 and 3.
(Unfortunately, terminology is inconsistent. The numactl(1) man page uses node to mean a CPU socket, and CPU to mean a single core.)
Many other options are available for numactl.
CPU restriction on hood
The taskset(1) command can be used to restrict a job to certain CPUs as well, e.g.
% taskset -pc 2,3 ./a.out
allows the a.out process to use only CPUs 2 and 3.
Topology details
Winisk and kazan are assembled from CPU/memory modules called C-bricks that are connected via special "NUMAlink" cabling to each other and/or to routing modules.
winisk (SGI Altix 350)
- two CPUs per C-brick
- 6 GB RAM per CPU
- sixteen C-bricks, each connected to one other C-brick and to one router (but one C-brick is offline)
- two routers
- each C-brick has another C-brick that is one hop away, and sevenother C-bricks on the same router that are two hops away, and seven other C-bricks on the other router that are three hops away
- run the gtopology(1) command to see a picture of the topology
- the pink squares are the C-bricks; the yellow and blue dots are the routers
kazan (SGI Altix 3700)
- four CPUs per C-brick
- 2 GB RAM per CPU
- 16 C-bricks, each connected to two different routers
- total of 64 CPUs, 128 GB memory in the machine
- two CPUs and their four GB local memory are reserved for the OS
- 62 CPUs and 124 GB RAM available for user jobs
- eight routers, each connected to two other routers
- fat tree topology
- unfortunately the gtopology command is not working on kazan
hood (SuperMicro 8047R)
- self-contained
- four 8-core CPU sockets, each with 64 GB of local memory
- use the lstopo(1) command to see a graphical representation, but it does not show the NUMA relationship among the CPUs