Host Memory Oversubscription
playbooks/bootstrap/site.yml applies a dedicated
lab_host_memory_oversubscription role immediately after
lab_host_resource_management.
That split is intentional:
lab_host_resource_managementdefines the CPU pools and Gold/Silver/Bronze placement modellab_host_memory_oversubscriptionimproves host RAM efficiency through three independent kernel mechanisms: zram compressed swap, Transparent Huge Pages, and Kernel Same-page Merging
This is not treated as fake RAM or as an excuse to reduce master or infra sizing. The goal is to reclaim duplicate and cold guest memory on a host that already showed low steady-state memory utilization with a fully deployed lab.
Warning
KSM and zram are host-kernel work, not Gold/Silver/Bronze work. The tiers still separate guest contention, but reclaim and compression can still steal CPU from the broader host pool unless you add more host-thread affinity controls later.
Source Of Truth
The orchestration source of truth for memory policy is
vars/global/host_memory_oversubscription.yml.
Current defaults:
| Subsystem | Setting | Value | Purpose |
|---|---|---|---|
| zram | enabled |
true |
activate compressed swap device |
| zram | device_name |
zram0 |
kernel device node |
| zram | size |
16G |
maximum uncompressed capacity of the device |
| zram | compression_algorithm |
zstd |
best ratio-to-speed tradeoff on modern kernels |
| zram | swap_priority |
100 |
ensures zram is preferred over any physical swap |
| zram.writeback | enabled |
false |
optional advanced override for a dedicated writeback device |
| zram.writeback | manage_lvm |
false |
create and manage a dedicated LV for writeback |
| zram.writeback | volume_group |
calabi_lab_vg |
VG used when manage_lvm is enabled |
| zram.writeback | logical_volume |
zram-writeback |
LV name used when manage_lvm is enabled |
| zram.writeback | size |
32G |
dedicated writeback LV size for the managed-LVM path |
| zram.writeback | backing_device |
"" |
dedicated block device used only when writeback is enabled |
| zram.writeback.policy | enabled |
false |
run periodic writeback from a systemd timer |
| zram.writeback.policy | interval |
30m |
timer cadence for each writeback pass |
| zram.writeback.policy | mode |
huge |
writes back pages that did not compress well |
| zram.writeback.policy | idle_age_seconds |
21600 |
age threshold only for idle or huge_idle modes |
| zram.writeback.policy | per_run_budget_mib |
256 |
writeback budget applied before each timer run |
| THP | mode |
madvise |
application-controlled huge page allocation |
| THP | defrag_mode |
madvise |
application-controlled compaction |
| KSM | run |
1 |
scanner active |
| KSM | pages_to_scan |
1000 |
pages examined per scan cycle |
| KSM | sleep_millisecs |
20 |
pause between scan cycles |
The role defaults in roles/lab_host_memory_oversubscription/defaults/main.yml
set everything to disabled. The global vars file overrides those defaults to
enable the policy. This ensures that the role is always safe to include
and only activates when explicitly configured.
How The Policy Is Applied
A single systemd oneshot service,
calabi-host-memory-oversubscription.service, applies all three subsystems at
boot. It uses RemainAfterExit=yes so systemd tracks the policy as active
for the lifetime of the host.
The service lifecycle for zram:
- tear down any existing zram device (
swapoff,zramctl --reset,modprobe -r) - load the zram module with
num_devices=1 - configure the compression algorithm
- optionally attach a dedicated writeback backing device before initialization
- set the zram disk size
- format and activate:
mkswap,swapon --priority 100 --discard
THP and KSM are applied in a follow-on ExecStart that writes directly to
/sys/kernel/mm/transparent_hugepage/ and /sys/kernel/mm/ksm/.
A separate dedicated playbook, playbooks/bootstrap/host-memory-oversubscription.yml,
can apply or re-apply the memory policy independently without re-running the
full bootstrap sequence. This is the intended entry point for the Calabi Manager
"Host Memory Oversubscription" change scope.
zram Compressed Swap
zram creates an in-memory block device that stores pages in compressed form. On write, the kernel compresses the page into zram; on read, it decompresses on the fly. The net effect is that cold anonymous pages that would otherwise consume full-size RAM frames are stored at a fraction of their original size.
The 16G size is the maximum uncompressed capacity of the device, not a
reservation. zram only consumes real RAM as pages are written into it. With
zstd compression the typical effective ratio on guest workloads is between
2:1 and 4:1, so 16G of logical swap capacity might cost 4-8G of physical RAM
when fully utilized.
Important
16G is a buffer, not a carved-out capacity loss. The host pays the
physical cost only when memory pressure actually drives pages into swap.
At steady state with low contention, zram consumes negligible real memory.
The swap priority of 100 ensures zram is always preferred over any physical
swap device. The --discard flag enables TRIM so that freed pages are
immediately released back to the host rather than lingering as stale compressed
blocks.
An optional advanced override can also attach a dedicated block device through
/sys/block/zram0/backing_dev before zram is initialized. This is disabled by
default and intended for hosts that deliberately repurpose a device for zram
writeback experimentation or cold-page pressure relief.
For on-prem hosts, the override can manage that device as a dedicated logical
volume in calabi_lab_vg. The shipped managed-LVM defaults use a 32G
zram-writeback LV, which is a reasonable starting point for the current
~128 GiB host class because it adds a modest cold-page spill tier without
pretending to create planned RAM capacity.
Applicability:
- The zram writeback capability itself is not on-prem-only. The role can
attach any dedicated block device when
manage_lvm: falseandbacking_deviceis set explicitly. - The managed-LVM convenience path is effectively on-prem-specific in this repo
because it assumes the local
calabi_lab_vglayout used by the on-prem deployment flow. - The repo currently ships a ready-to-use on-prem example override for this capability. It does not ship an AWS-target override or AWS-specific backing-device provisioning for writeback.
Warning
Enabling a writeback backing device does not turn disk into planned memory capacity. It is an emergency pressure-relief tier and should not be counted in steady-state cluster sizing.
Note
The current role only attaches the backing device when the override is enabled. Periodic writeback is a separate opt-in policy block and remains off by default.
Warning
The writeback backing device must be dedicated. Do not point the override at a mounted filesystem or an active swap device. The role explicitly fails if the configured device is already mounted or active as swap.
Warning
The managed-LVM path creates the writeback LV when requested, but it does not resize an existing writeback LV in place. If the LV already exists at a different size, the role fails fast rather than mutating storage automatically.
Example override shape:
lab_host_memory_oversubscription_settings: zram: writeback: enabled: true manage_lvm: true volume_group: calabi_lab_vg logical_volume: zram-writeback size: 32G policy: enabled: true interval: 30m mode: huge per_run_budget_mib: 256If you prefer to point at an already-provisioned block device instead, leave
manage_lvm: false and set backing_device directly.
The managed LV or explicit device must be dedicated to zram writeback. Do not point the override at a mounted filesystem or an active swap device.
For the current on-prem deployment, the shipped example is:
Periodic Writeback Policy
When zram.writeback.policy.enabled is true, the role installs:
calabi-zram-writeback-policy.servicecalabi-zram-writeback-policy.timer
The timer triggers a small writeback pass at the configured interval. Before
each run, the helper script applies the configured per-run budget through
writeback_limit so the host does not dump an unlimited amount of data to the
backing device in one burst.
Recommended starting point for the current ~128 GiB host class:
- mode:
huge - interval:
30m - per-run budget:
256 MiB
That mode is intentional. The current kernel on the on-prem host exposes zram
writeback support, but age-based idle tracking may not be available on every
target kernel. The role therefore only allows idle or huge_idle modes when
CONFIG_ZRAM_TRACK_ENTRY_ACTIME=y is present on the running kernel.
Treat this timer as pressure relief, not planned memory inventory.
Additional caveats:
- The timer only becomes useful after a backing device is attached. Enabling the policy block without a working writeback device is not a valid configuration.
- The default mode,
huge, writes back pages that zstd could not compress well (stored at near-original size in RAM). This reclaims the physical memory those pages consume and pushes them to the backing device where the storage cost is negligible. For broader cold-page relief, usehuge_idlewith an appropriateidle_age_secondsthreshold, which also sweeps idle compressible pages. - The timer applies a per-run
writeback_limitbudget to reduce the chance of sudden I/O bursts, but heavy writeback can still add latency to the backing storage tier under memory pressure. - On hosts that still keep a separate physical swap device enabled, zram writeback does not remove that device from the reclaim path automatically. Disk-backed swap remains a separate policy decision.
Two-Tier Swap Model
When a writeback backing device is attached and the policy timer is active, the zram device operates as a two-tier swap system:
| Tier | Backing | Physical cost | Latency | Contents |
|---|---|---|---|---|
| RAM tier | host physical memory | compressed size (typ. 2-4x reduction) | nanoseconds | pages that compressed well |
| Backing tier | dedicated block device | zero RAM | microseconds (NVMe) to milliseconds (SSD) | pages that did not compress well, pushed by the writeback policy |
The 16G zram disksize caps total uncompressed data across both tiers. When
the device is 92% full, that means 92% of the 16 GiB logical capacity is
occupied -- but the physical RAM cost depends on how much was compressed in the
RAM tier vs. how much was offloaded to the backing tier.
Example from a running 10-VM lab on a 125 GiB host:
Total swap data: 14.7 GiB uncompressed (92% of 16 GiB cap)├── RAM tier: 4.0 GiB physical (compressed from ~12.9 GiB)│ └── Compression: 3.2:1 ratio, saving ~8.9 GiB└── Backing tier: 1.8 GiB on NVMe (zero RAM cost) └── Headroom: 30.2 GiB of 32 GiB LV remainingEffective RAM saved: 10.7 GiB (compression) + 1.8 GiB (writeback) = 12.5 GiBWithout the backing tier, those 1.8 GiB of incompressible pages would consume ~1.8 GiB of physical RAM (stored at near-original size in the zram device).
The Cockpit observer's zram panel reflects this model:
- The Swap Capacity card shows separate RAM-tier and backing-tier occupancy meters
- The Utilization Mix bar shows four segments: RAM cost, compression avoided, writeback offloaded, and free logical capacity
- The RAM Avoided card splits its total into compression savings and writeback offload
- The memory overview's pressure scoring reduces the zram-utilization penalty when a backing store is active with headroom, since 92% zram occupancy with a working writeback tier is not the same threat as 92% without one
Transparent Huge Pages
THP allows the kernel to back anonymous memory with 2 MiB pages instead of the default 4 KiB pages. Fewer page-table entries means lower TLB miss rates and measurable throughput improvement for memory-intensive workloads.
The policy sets THP to madvise, not always:
madvisemeans the kernel only allocates huge pages when the application explicitly requests them viamadvise(MADV_HUGEPAGE). QEMU and the JVM both do this when configured to.alwayswould apply huge pages to every anonymous mapping. That can trigger aggressive background compaction and allocation stalls that are worse than the TLB improvement.
The defrag mode is also set to madvise for the same reason: compaction
only runs when an application has signaled that it wants huge pages. This
avoids the pathological case where khugepaged burns CPU compacting memory
that no process actually benefits from.
Note
For this workload, madvise is the conservative and correct default. The
guest kernels inside each VM make their own independent THP decisions. The
host-level setting controls the outer hypervisor kernel behavior only.
Kernel Same-page Merging
KSM is a kernel thread (ksmd) that scans anonymous pages across all processes
looking for byte-identical content. When it finds duplicates, it merges them
into a single copy-on-write page, freeing the redundant frames.
This is especially effective in a nested virtualization environment where
multiple guests run identical operating system images. The RHEL CoreOS nodes
(ocp-master, ocp-infra, ocp-worker) share a large fraction of their
kernel and base-OS memory footprint. KSM finds and deduplicates those pages
without any guest-side configuration.
Current scan settings:
pages_to_scan = 1000: examine 1000 pages per scan cyclesleep_millisecs = 20: pause 20 ms between cycles
These are deliberately conservative. Aggressive settings (higher page count, shorter sleep) merge faster but consume more host CPU. The current values prioritize low steady-state CPU overhead over fast initial convergence.
KSM convergence behavior:
- First scan pass: slow. The scanner must build its internal red-black tree of page checksums across all guest memory. On a fully deployed lab this can take minutes to hours depending on total guest memory.
- Steady state: cheap. Once the initial tree is built, incremental scans only process new or changed pages. CPU cost drops to near zero when guest memory is stable.
- After guest reboot or migration: the scanner re-examines changed pages. A full cluster reboot temporarily increases KSM CPU usage until the new steady state is reached.
Note
The policy is most valuable in low-to-medium contention: it gives the kernel a cheaper way to reclaim duplicate memory before direct reclaim gets expensive. It is not meant to rescue sustained high contention.
Why Bronze Is The Elastic Tier
The Bronze domain is already the least latency-sensitive part of the guest estate:
ocp-worker-01..03bastion-01mirror-registryad-01
That makes Bronze the correct place to absorb most elasticity pressure before touching masters or infra. The intended sizing policy is:
- keep masters stable
- keep infra stable
- use workers as the first expansion or contraction lever
CPU Placement Caveat
KSM and reclaim activity are host/kernel work, not guest-tier work.
So while Gold/Silver/Bronze still model guest-vs-guest contention, enabling
zram, KSM, and THP does not automatically pin those host-kernel threads into
one guest tier. The current role improves memory efficiency without claiming
that all reclaim and merge work is strictly isolated inside host_reserved.
Operational Validation
After bootstrap, validate the memory policy from the host:
# Service statesystemctl is-enabled calabi-host-memory-oversubscription.servicesystemctl is-active calabi-host-memory-oversubscription.service# zram devicezramctlswapon --showcat /sys/block/zram0/backing_devsystemctl is-enabled calabi-zram-writeback-policy.timersystemctl is-active calabi-zram-writeback-policy.timercat /sys/block/zram0/bd_stat# THP modecat /sys/kernel/mm/transparent_hugepage/enabledcat /sys/kernel/mm/transparent_hugepage/defrag# KSM statecat /sys/kernel/mm/ksm/runcat /sys/kernel/mm/ksm/pages_to_scancat /sys/kernel/mm/ksm/sleep_millisecs# KSM effectivenesscat /sys/kernel/mm/ksm/pages_sharedcat /sys/kernel/mm/ksm/pages_sharingcat /sys/kernel/mm/ksm/pages_unsharedExpected current behavior:
- service is enabled and active
zramctlshows/dev/zram0withzstdalgorithm and16Gdisk sizeswaponshows/dev/zram0at priority100backing_devmatches the configured override when writeback is enabled- the writeback timer is enabled and active when policy is enabled
bd_statshows nonzerobd_writeswhen the writeback policy has run at least once (zerobd_writeswith a configured backing device means the policy is not writing back -- check the mode ishuge, not a mode the kernel silently ignores)- THP enabled shows
[madvise](bracketed = active selection) - THP defrag shows
[madvise] - KSM
runis1 pages_sharedandpages_sharinggrow over time as guests stabilize
The project includes a monitoring script for continuous observation:
scripts/host-memory-overcommit-status.py --host <virt-01-ip> --user ec2-userThis queries zram usage, KSM deduplication savings, per-guest memory
allocation, and tier-level totals. Use --watch 30 for a live refresh or
--delta 60 to capture a before-and-after snapshot across an interval.
Signals That Memory Policy Needs Adjustment
zram under-sized:
zramctlshows the device near its configured size limit- swap utilization stays persistently high with poor compression ratio
- consider increasing
sizeor investigating which guest is driving pressure
zram over-sized:
- the device rarely holds more than a few hundred MiB
- the host never enters memory pressure
- not harmful, but the 16G buffer is idle weight in the config
KSM scan rate too conservative:
pages_unsharedremains high relative topages_sharingfor extended periods after guest deployment- initial convergence takes unreasonably long
- consider increasing
pages_to_scanto2000-4000and observing CPU impact
KSM scan rate too aggressive:
ksmdappears intopconsuming visible CPU during steady state- host-side CPU pressure appears on the reserved pool
- reduce
pages_to_scanor increasesleep_millisecs
THP causing compaction pressure:
khugepagedorkcompactdconsuming persistent CPU- this is unlikely with
madvisemode but can appear if guest kernels aggressively request huge pages via virtio-balloon or similar - switching to
neveris a safe fallback that disables THP entirely