Teaching Reference

Architecture

CPU, performance-domain, and host-sizing rationale.

Host Resource Management

This explains why virt-01 is sized and scheduled the way it is, especially once the host is busy.

The root model comes from Greg Procunier's cgroup-tiering thesis for OpenStack:

https://github.com/gprocunier/openstack-cgroup-tiering

This is the single-host version of that idea. The OpenStack-specific scheduler and placement pieces are gone, but the core idea is still the same:

keep host and guest execution domains distinct
let guests compete in weighted performance domains instead of a flat pool
preserve idle borrowing while making degradation behavior intentional

So the implementation here is best read as a fixed-workload, single-host adaptation of the original cgroup-tiering model, not a separate design that started from zero.

The practical goal:

keep ordinary host-side work away from guest vCPU execution as much as practical
let all guest vCPU threads compete in one shared execution pool
make contention behavior intentional by placing guests into Gold, Silver, and Bronze performance domains instead of a flat scheduler pool

The implementation is built from standard Linux, systemd, libvirt, and virt-install controls. The unusual part is the composition of those mundane controls into a single scheduling policy.

Current Implemented State

The current codebase intentionally keeps the guest-side policy stronger than the host-side policy.

Implemented:

machine-gold.slice, machine-silver.slice, and machine-bronze.slice
manager-level systemd CPUAffinity= for host-managed services
libvirt partition placement into the tier slices
libvirt shares
libvirt vcpupin
libvirt emulatorpin
libvirt iothreadpin for selected guests
a dedicated host memory oversubscription role that applies:
- zram
- KSM
- THP madvise

Intentionally not enabled by default:

kernel boot arguments:
- systemd.cpu_affinity=...
- irqaffinity=...
IRQBALANCE_BANNED_CPULIST=...

Note

These stronger host-isolation controls were trialed and backed out. They made host-side admin workflows (SSH, Cockpit, OVS diagnostics) less predictable under load. The settled design keeps VM tiering but avoids over-constraining the entire host control plane.

Host Memory Oversubscription

A separate lab_host_memory_oversubscription role applies zram compressed swap, Transparent Huge Pages, and Kernel Same-page Merging to improve host RAM efficiency. This is documented in its own file:

HOST MEMORY OVERSUBSCRIPTION

Design Summary

flowchart TB host[virt-01 host] host --> reserved["Host reserved CPUs\nsystemd-managed host work"] host --> guest["Guest execution CPUs\nshared VM vCPU pool"] reserved --> housekeeping["Housekeeping subset\nOVS, libvirt daemons,\ngeneral host services"] reserved --> emulator["Emulator subset\nQEMU emulator threads\nand selected IOThreads"] guest --> gold["Gold domain\nmasters"] guest --> silver["Silver domain\ninfra plus IdM"] guest --> bronze["Bronze domain\nworkers plus bastion plus mirror"]

At runtime:

host services are constrained by manager CPUAffinity to the reserved pool
guest vCPU threads are pinned to the guest execution pool
QEMU emulator threads are pinned to the emulator subset of the reserved pool
systemd tier slices provide relative weight when tier siblings contend

Hardware Topology

virt-01 currently runs on AWS m5.metal:

2 sockets / 2 NUMA nodes
24 physical cores per socket
2 SMT threads per core
96 logical CPUs total

The CPU numbering is regular:

socket 0 primary threads: 0-23
socket 0 SMT siblings: 48-71
socket 1 primary threads: 24-47
socket 1 SMT siblings: 72-95

Examples:

core pair 0,48
core pair 5,53
core pair 24,72
core pair 47,95

That regularity matters because the policy always reserves or allocates whole SMT core pairs, never isolated sibling threads.

Current CPU Pools

The orchestration source of truth is vars/global/host_resource_management.yml.

Current pool definitions:

host_reserved: 0-5,24-29,48-53,72-77
host_housekeeping: 0-1,24-25,48-49,72-73
host_emulator: 2-5,26-29,50-53,74-77
guest_domain: 6-23,30-47,54-71,78-95

Pool meaning:

Pool	CPUs	Physical cores	Purpose
`host_reserved`	`0-5,24-29,48-53,72-77`	12	effective host pool exposed to manager `CPUAffinity`
`host_housekeeping`	`0-1,24-25,48-49,72-73`	4	host networking, OVS, libvirt daemons, general userspace host work
`host_emulator`	`2-5,26-29,50-53,74-77`	8	QEMU emulator threads and explicit IOThreads
`guest_domain`	`6-23,30-47,54-71,78-95`	36	shared guest vCPU execution pool

This preserves:

6 physical host cores per socket
18 physical guest cores per socket

Or, across the full host:

12 physical / 24 logical CPUs for host work
36 physical / 72 logical CPUs for guest vCPU execution

Current Host Policy

The host policy deliberately stops short of hard scheduler isolation:

systemd manager CPUAffinity is enabled for host_reserved
dedicated tier slices are installed and weighted
no kernel affinity boot args are managed by default
no guest-domain irqbalance ban is managed by default

In other words, the host keeps a preferred execution pool, but the design does not currently try to force early-boot IRQ and scheduler behavior into a narrow housekeeping regime.

Performance Domains

All guest vCPU threads share the same guest_domain cpuset. The Gold, Silver, and Bronze domains are not separate cpusets; they are sibling scheduling domains with different relative weight.

Tier	systemd partition	`CPUWeight`	libvirt `shares`	Intended class
Gold	`/machine/gold`	`512`	`512`	control plane
Silver	`/machine/silver`	`333`	`333`	infra and support services
Bronze	`/machine/bronze`	`167`	`167`	general workload capacity

flowchart LR guest["guest_domain CPUs\n6-23,30-47,54-71,78-95"] guest --> gold["/machine/gold\nCPUWeight=512"] guest --> silver["/machine/silver\nCPUWeight=333"] guest --> bronze["/machine/bronze\nCPUWeight=167"]

Important

These are relative shares under contention, not hard reservations. Idle CPU in Gold or Silver can still be borrowed by Bronze. The model improves graceful degradation; it does not create deterministic latency guarantees.

Guest Tier Mapping

Current role-to-tier mapping:

Guest class	Tier	Reason
`ocp-master-01..03`	Gold	API, etcd, control-plane stability
`ocp-infra-01..03`	Silver	ingress, monitoring, registry, ODF-adjacent workload
`idm-01`	Silver	authentication, DNS, CA, bastion dependency
`ocp-worker-01..03`	Bronze	least critical general scheduling pool
`mirror-registry`	Bronze	important, but deferrable relative to cluster control plane
`bastion-01`	Bronze	operator access host, but not part of cluster control path
`ad-01`	Bronze	Windows AD for hybrid identity testing, not cluster-critical

The table is authoritative. The diagram below is only the compact tier view.

flowchart TB guest["Guest workload classes"] guest --> gold["Gold\nocp-master-01..03"] guest --> silver["Silver\nocp-infra-01..03\nidm-01"] guest --> bronze["Bronze\nocp-worker-01..03\nmirror-registry\nbastion-01\nad-01"]

Thread Placement Model

The thread-placement model is as important as the weights.

flowchart LR qemu[QEMU domain] qemu --> vcpu[vCPU threads] qemu --> emu[emulator thread] qemu --> io[optional IOThread] vcpu --> guest[guest_domain] emu --> hostemu[host_emulator] io --> hostemu

Current intent by thread class:

vCPU threads:
- pinned to guest_domain
emulator thread:
- pinned to host_emulator
explicit IOThread:
- pinned to host_emulator
ordinary host services:
- inherit the manager-level host_reserved mask

The tier slices intentionally do not set AllowedCPUs=. CPU placement is controlled by libvirt thread pinning so that guest execution can live on guest_domain while emulator-side noise remains on host_emulator.

Current `virt-install` Policy

Guest creation uses virt-install directly. For each VM, the orchestration derives the tier from inventory and then applies:

--resource partition=...
--cputune shares=...
vcpupinN.cpuset={{ guest_domain }}
emulatorpin.cpuset={{ host_emulator }}
--iothreads ... plus iothreadpin... when the guest is marked for explicit IOThreads

Selected guests currently receive one IOThread:

ocp-infra-01
ocp-infra-02
ocp-infra-03
mirror-registry

That is a pragmatic first pass for storage-heavy guests. It is not a claim that all host-side IO noise has been fully isolated.

Current Capacity Picture On `m5.metal`

m5.metal provides 96 logical CPUs and 384 GiB RAM. The current layout commits 124 vCPUs and 344 GiB of guest memory across 13 guests. This is a deliberate oversubscription on CPU managed by the Gold/Silver/Bronze tier model, with memory sitting just under a 1:1 commit ratio where KSM and zram provide the safety margin.

Present configured guest allocations:

Class	Count	vCPU	Memory	Total vCPU	Total Memory
masters	3	8	24 GiB	24	72 GiB
infra	3	16	64 GiB	48	192 GiB
workers	3	12	16 GiB	36	48 GiB
IdM	1	4	8 GiB	4	8 GiB
mirror-registry	1	4	8 GiB	4	8 GiB
bastion	1	4	8 GiB	4	8 GiB
ad	1	4	8 GiB	4	8 GiB
total	13 guests			124	344 GiB

Against the host:

guest execution pool: 72 logical CPUs
current aggregate guest vCPU count: 124
current vCPU oversubscription ratio: 1.72:1
host physical RAM: 384 GiB
current memory commit ratio: 0.90:1

vCPU Oversubscription Context

Scenario	Worker shape	Total guest vCPU	Ratio vs 72-logical guest pool	Notes
historical validated baseline	`3 x 4`	98	`1.36:1`	known good baseline before worker uplift
previous repo default	`3 x 8`	110	`1.53:1`	first uplift from the original 4-vCPU workers
current repo default	`3 x 12`	124	`1.72:1`	chosen default, validated with memory oversub in place

The move to 3 x 12 workers was made possible by the memory oversubscription work (KSM, zram, THP). The host demonstrated low steady-state memory utilization and minimal %steal at the previous 3 x 8 shape, giving confidence to push workers to the larger allocation.

Memory Commit Context

The 0.90:1 memory commit ratio means guest allocations stay below host RAM before accounting for the host kernel, page cache, and host-side services. In practice the host remains comfortable because:

not all guest memory is active simultaneously (cold pages exist in every VM)
KSM deduplicates 10-30 GiB of shared RHCOS kernel and base-OS pages
zram provides a compressed swap buffer for cold anonymous pages under pressure
the host kernel's own page cache and slab are reclaimable

The infra nodes at 64 GiB each are the dominant memory consumers (192 GiB total, 53% of host RAM). This is driven by ODF (Ceph OSD memory), monitoring stack, and ingress router workloads that all land on infra.

Note

The memory commit ratio is deliberately kept below 1.0 so that the host does not depend on KSM or zram for basic stability. Those features improve headroom and absorb transient pressure, but the base allocation fits without them.

Resizing Guidance For Other Metal Hosts

Warning

The design is portable, but the CPU sets are not. If you move to a different metal host shape, you must recompute every pool mask. Copy-pasting the m5.metal values onto a different topology will produce wrong pinning.

Invariants To Keep

reserve whole physical cores, not lone SMT siblings
keep the reserved pool symmetric across sockets
keep one shared guest vCPU pool
use Gold, Silver, and Bronze as sibling weighted domains inside that one guest pool
keep emulator threads out of the guest pool

Starting Method

For a new host:

Determine:
- sockets
- physical cores per socket
- SMT sibling numbering
Choose a host reservation per socket as whole physical cores.
Split that reservation into:
- 2 housekeeping cores per socket
- the remainder as emulator/helper cores per socket
Allocate all remaining whole cores to guest_domain.
Recompute:
- host CPU masks
- guest CPU mask
- effective guest-pool logical CPU count
- projected guest-vCPU oversubscription ratios

Reasonable Starting Reservations

Host shape	Starting host reservation per socket	Notes
`<= 16` physical cores/socket	`4`	`2` housekeeping + `2` emulator is the minimum sensible split
`24` physical cores/socket	`6`	current `m5.metal` design: `2` housekeeping + `4` emulator
`>= 32` physical cores/socket	`6` or `8`	start at `6`; move to `8` only if real host-side pressure justifies it

When The Host Gets Smaller

On a smaller metal host, reduce guest ambition before weakening the host pool too aggressively.

Preferred order of compromise:

keep masters stable
reduce worker target growth
reduce support-guest size where tolerable
only then consider squeezing the host reservation further

When The Host Gets Larger

On a larger metal host, grow the guest pool first. Do not change the tier weights just because more cores are available.

Preferred order of expansion:

increase guest_domain
validate host-side pressure remains healthy
increase worker vCPUs
rebalance weights only if measurement says the contention model is wrong

Operational Validation

After bootstrap and guest creation, validate both the host policy and the guest placement.

Host checks:

grep -E '^(DefaultCPUAccounting|CPUAffinity)=' \  /etc/systemd/system.conf.d/90-aws-metal-openshift-demo-host-resource-management.confsystemctl show machine-gold.slice machine-silver.slice machine-bronze.slice \  -p CPUAccounting -p CPUWeightgrep Cpus_allowed_list /proc/1/statuscat /proc/cmdline

Expected current behavior:

manager CPUAffinity present
tier slices present with expected weights
PID 1 confined to host_reserved
no systemd.cpu_affinity= or irqaffinity= boot args by default

Guest placement checks:

virsh dumpxml idm-01.workshop.lan | egrep 'partition|shares|iothreads'virsh vcpupin idm-01.workshop.lanvirsh emulatorpin idm-01.workshop.lansystemctl status machine-silver.slice

Expected current behavior:

persistent XML contains the tier partition
persistent XML contains the expected shares
vCPU pinning lands in guest_domain
emulator pinning lands in host_emulator

Signals That The Model Needs Adjustment

Host reservation too small:

sshd and Cockpit become sticky under modest admin activity
OVS or libvirt daemons show persistent pressure on reserved CPUs
emulator-thread contention appears during IO-heavy guest activity

Host reservation too large:

guest pool is undersized relative to lab demand
workers remain cramped even though host services stay quiet
%steal rises while host-side reserved CPUs remain mostly idle

Tier weights need reevaluation:

masters are still impacted during cluster churn
infra services dominate at the expense of worker throughput
Bronze never benefits from obvious idle capacity above it

Calabi

Host Resource Management

Current Implemented State

Host Memory Oversubscription

Design Summary

Hardware Topology

Current CPU Pools

Current Host Policy

Performance Domains

Guest Tier Mapping

Thread Placement Model

Current `virt-install` Policy

Current Capacity Picture On `m5.metal`

vCPU Oversubscription Context

Memory Commit Context

Resizing Guidance For Other Metal Hosts

Invariants To Keep

Starting Method

Reasonable Starting Reservations

When The Host Gets Smaller

When The Host Gets Larger

Operational Validation

Signals That The Model Needs Adjustment

Continue

Host Resource Management

Current Implemented State

Host Memory Oversubscription

Design Summary

Hardware Topology

Current CPU Pools

Current Host Policy

Performance Domains

Guest Tier Mapping

Thread Placement Model

Current virt-install Policy

Current Capacity Picture On m5.metal

vCPU Oversubscription Context

Memory Commit Context

Resizing Guidance For Other Metal Hosts

Invariants To Keep

Starting Method

Reasonable Starting Reservations

When The Host Gets Smaller

When The Host Gets Larger

Operational Validation

Signals That The Model Needs Adjustment

Related Documents

Continue

Current `virt-install` Policy

Current Capacity Picture On `m5.metal`