Zone reclaim mode

Non-uniform memory access or NUMA is not a new concept but high end multiprocessor Intel-based servers are increasingly configured with this architecture, bringing it more to the mainstream. Put simply NUMA means that instead of all processors accessing your main system memory through a common bus, each processor is allocated an even share of the memory that it can address directly. If a processor needs to access memory controlled by another processor it can do so through that other processor.

Linux kernels from v2.5 onwards are aware of any NUMA architecture and it can be displayed using numactl -H or numactl –hardware:

node distances:
node 0 1 2 3
0: 10 21 21 21
1: 21 10 21 21
2: 21 21 10 21
3: 21 21 21 10

The above is from a four socket server. It shows that fetching from local memory is weighted at ’10’ and from memory controlled by other processors ’21’. I strongly suspect these weightings are hard coded.

numactl -H also shows information about how the memory is split between processors. The term ‘node’ is used:

available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39
node 0 size: 65418 MB
node 0 free: 310 MB
node 1 cpus: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47
node 1 size: 65536 MB
node 1 free: 41 MB
node 2 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55
node 2 size: 65536 MB
node 2 free: 82 MB
node 3 cpus: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63
node 3 size: 65536 MB
node 3 free: 43 MB

What the above shows is that the free memory available to each node varies. If a process running on node 3, in our example, needs to allocate memory and it needs more than 43 Mb, it can either:

  • Use memory assigned to another node, for example node 0. This means the memory access will not be local.
  • Reclaim memory from node 3’s local memory by evicting other pages from memory.

The kernel switch vm.zone_reclaim_mode controls which behaviour is used. If set to 1 it will prefer to evict other pages from memory.

This is explained in a great more detail in this article by Christoph Lameter.

How is this parameter set on your system? You can check by running cat /proc/sys/vm/zone_reclaim_mode

If it’s set to 1 on your Informix system you should definitely read on. You’ll be glad to hear this parameter can be changed dynamically.

In the latest kernels (2014 onwards) this commit means that the parameter will never be set on your system automatically but if you’re running an enterprise Linux you could be on a kernel version like 2.6.32 (RHEL 6) where this can occur: although patched the base version of this dates from 2009.

I am not sure of the exact criteria that determine when older Linux kernels will switch on this feature at boot up. I think you need a modern four (or more) processor server with a NUMA architecture but there may be other requirements.

It’s interesting to read the slightly repetitious kernel commit log:

When it was introduced, zone_reclaim_mode made sense as NUMA distances punished and workloads were generally partitioned to fit into a NUMA node. NUMA machines are now common but few of the workloads are NUMA-aware and it’s routine to see major performance degradation due to zone_reclaim_mode being enabled but relatively few can identify the problem.

Those that require zone_reclaim_mode are likely to be able to detect when it needs to be enabled and tune appropriately so lets have a sensible default for the bulk of users.

This patch (of 2):

zone_reclaim_mode causes processes to prefer reclaiming memory from local node instead of spilling over to other nodes. This made sense initially when NUMA machines were almost exclusively HPC and the workload was partitioned into nodes. The NUMA penalties were sufficiently high to justify reclaiming the memory. On current machines and workloads it is often the case that zone_reclaim_mode destroys performance but not all users know how to detect this. Favour the common case and disable it by default. Users that are sophisticated enough to know they need zone_reclaim_mode will detect it.

Hopefully now the relevance to Informix is becoming a little clearer. Certainly there has been much complaining in the PostgreSQL community about this parameter. Another frustrated blog post describes some of the massive I/O latency problems it can cause on your system even when under no obvious memory pressure.

On our Informix system, which uses huge pages, we have experienced long disruptive checkpoints as a result of zone reclaiming. As huge pages are not swappable, it’s likely to be our monitoring and other non-Informix processes provoking the zone reclaims.

The long checkpoint behaviour can be summarised as:

  • A checkpoint is triggered by CKPTINTVL.
  • Informix instructs all threads to finish what they are doing and goes into state CKPT REQ.
  • One or more threads may be in critical section and must continue to the end of this section before it can stop.
  • A zone reclaim is occurring and I/O throughput dramatically decreases and this thread takes many seconds to come out of critical section.
  • All active threads wait (state C in the first column of onstat -u).
  • Eventually the operation completes, the checkpoint actually occurs very quickly and processing continues.

This behaviour can occur in later versions of the engine with non-blocking checkpoints.

If you have the mon_checkpoint sysadmin task enabled (I strongly recommend this), information about your checkpoints will be written to sysadmin:mon_checkpoint. (Otherwise you only retain information about the last twenty checkpoints visible through onstat -g ckp.) A tell tale sign is a large crit_time, nearly all of the checkpoint duration, and a much smaller flush_time.

You can get further evidence of whether a zone reclaim might be occurring at the same time by looking at the number of pages scanned per second in the output from sar -B. (sar is a very sophisticated monitoring tool these days with views into many aspects of the operating system.)

One test you can try (on a test server) is LinkedIn Engineering’s GraphDB simulator. It’s a C++ program that mimics the behaviour of GraphDB and is designed to provoke zone reclaim behaviour from the Linux kernel if it is switched on.

On our test system we can leave it running for hours without zone reclaim enabled and monitor it through sar -B.

10:30:55 AM pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s pgscand/s pgsteal/s %vmeff
10:31:00 AM 951.42 20993.52 8415.59 0.81 1351.62 0.00 0.00 0.00 0.00
10:31:05 AM 294.97 20930.38 8764.59 2.21 3286.92 0.00 0.00 0.00 0.00
10:31:10 AM 170.28 24627.31 4939.16 1.61 1859.64 32276.31 16282.73 565.06 1.16
10:31:15 AM 193.12 77519.03 5379.96 1.42 53762.75 4495.55 0.00 93.72 2.08
10:31:20 AM 240.24 88966.60 6875.45 1.81 1483.30 0.00 0.00 0.00 0.00
10:31:25 AM 183.50 277.67 8113.28 1.61 4045.47 0.00 0.00 0.00 0.00
10:31:30 AM 202.41 280.08 11409.46 2.82 3114.29 0.00 0.00 0.00 0.00
10:31:35 AM 243.37 255.42 8815.46 2.21 1905.62 0.00 0.00 0.00 0.00
10:31:40 AM 92.37 194.38 5890.96 1.00 1059.84 0.00 0.00 0.00 0.00
10:31:45 AM 283.70 313.08 12742.05 2.21 5263.38 0.00 0.00 0.00 0.00
10:31:50 AM 414.83 11179.96 7938.48 2.00 45495.59 39413.23 0.00 784.17 1.99
10:31:55 AM 198.79 31014.95 9007.47 2.63 2374.95 0.00 0.00 0.00 0.00
10:32:00 AM 235.74 25065.86 10159.84 2.61 1866.47 0.00 0.00 0.00 0.00
10:32:05 AM 202.01 37361.45 11010.24 2.01 3250.00 0.00 0.00 0.00 0.00
10:32:10 AM 256.91 5640.48 7596.59 3.01 3638.08 0.00 0.00 0.00 0.00
10:32:15 AM 246.89 20823.65 5411.42 1.80 1704.21 0.00 0.00 0.00 0.00
10:32:20 AM 114.46 41366.27 6625.30 0.80 1352.41 0.00 0.00 0.00 0.00
10:32:25 AM 188.76 20948.19 25422.09 1.81 8850.20 0.00 0.00 0.00 0.00
10:32:30 AM 177.15 29934.67 9358.52 1.60 54522.65 42292.59 4315.83 1071.14 2.30
10:32:35 AM 237.83 9914.69 9167.40 2.21 2483.50 0.00 0.00 0.00 0.00
10:32:40 AM 207.71 81296.55 8555.17 2.64 2631.85 0.00 0.00 0.00 0.00

The test itself reports latencies over 100 ms and in this mode we occasionally see I/O operations taking around 200 ms reported.

We can change the kernel parameter dynamically while the test is running and see the behaviour change almost immediately:

10:35:15 AM pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s pgscand/s pgsteal/s %vmeff
10:35:20 AM 365.06 15634.14 6300.40 3.41 3841.57 0.00 15241.77 2644.18 17.35
10:35:25 AM 333.06 5519.35 9262.10 3.43 8639.31 0.00 92890.32 4528.63 4.88
10:35:30 AM 1158.15 20868.81 10292.96 10.06 12215.09 0.00 255137.22 7858.55 3.08
10:35:35 AM 781.12 41385.54 7742.77 5.02 5841.16 0.00 34506.02 3422.89 9.92
10:35:40 AM 518.10 8764.47 2524.85 3.25 2906.59 0.00 1703326.11 2016.93 0.12
10:35:52 AM 2576.57 39524.85 13449.49 11.31 10332.12 0.00 1153144.24 4256.77 0.37
10:35:57 AM 2707.22 40786.31 7962.55 8.17 9893.92 0.00 4246095.82 6729.66 0.16
10:36:02 AM 1600.75 1889.37 2551.12 4.34 629.04 0.00 3595585.63 253.52 0.01
10:36:16 AM 756.94 39362.58 2063.18 8.25 3785.71 0.00 4238635.01 1814.29 0.04
10:36:21 AM 990.94 9277.31 1584.26 6.24 1692.88 0.00 6222810.91 833.73 0.01
10:36:52 AM 69.73 0.00 116.91 0.96 271.29 0.00 2056531.75 7.20 0.00

The number of pages scanned per second escalates.

Meanwhile I/O latencies reported by the test program escalate up to 36000 ms. We actually have to kill the test program within 30 seconds of changing the kernel parameter to avoid the system becoming so unresponsive it cannot maintain sshd connections.

In our real world Informix example we are not using the page cache anything like as aggressively and when the problem occurs I/O demands reduce as we get down to a single thread in critical section. Thus we don’t see pages scanned at the rate in the test, just a clear increase.

It’s worth mentioning that new NUMA capabilities have been added to the Linux kernel in version 3.8 (and later in 3.13) so RHEL 7 users might see slightly different behaviour.


2 Comments on “Zone reclaim mode”

  1. Nick Perry says:

    Excellent article Ben and interesting background reading in the links. Glad you’ve got it sorted. I wasn’t aware of this potential problem.

  2. Phil says:

    Thanks for properly explaining who is affected and mentioning the small subset of applications that do need it. Other sources make it less clear that latency-sensitive simulations and NUMA-aware codes may need this setting.


Leave a comment