Orchestration Guide
Come here when you need to answer questions like:
- which playbook owns this phase?
- which role is doing the real work?
- where should a fix land?
It maps the Ansible side of the repo: role boundaries, critical tasks, and the workflows that matter during build, teardown, and disconnected OpenShift preparation.
Keep these nearby while you use this page:
- AUTOMATION FLOW for run order and lifecycle
- ORCHESTRATION PLUMBING for workstation-to-bastion handoff, runner files, and dashboard telemetry
- AUTH MODEL for the formal current-state identity and authorization architecture
- AD / IDM POLICY MODEL for the planned future AD-source-of-truth authorization model
- IAAS MODEL for AWS and cloud-init design
- RESOURCE MANAGEMENT for guest tiers and CPU pools
- CLUSTER MATRIX for per-node identities and sizing
Current Project Layout
Use this section when you need the maintainer view of the repo shape rather than the operator entry path. The supported deployment path still starts in PREREQUISITES.
Table Of Contents
Use this when you need to answer "where should I look" before you answer "what exactly is broken."
- Top-Level Playbooks
- Data Files
- Project Dependencies
- Host Roles
- AD Roles
- IDM Roles
- OpenShift Cluster Roles
- Mirror Registry Roles
- Validation Practices In Use
- Known Gaps
Top-Level Playbooks
The canonical operator entrypoints are:
./scripts/run_local_playbook.shplaybooks/site-bootstrap.yml./scripts/run_remote_bastion_playbook.shplaybooks/site-lab.yml
The top-level playbook artifacts behind those entrypoints are:
The CloudFormation stack split, guest-disk inventory model, and first-boot cloud-init behavior are documented in IAAS MODEL. This page stays focused on which playbook or role owns each part once the substrate is already there.
playbooks/site-bootstrap.yml
Purpose:
- run the outside-facing bootstrap phase from the operator desktop
Execution model:
- imports:
playbooks/site-lab.yml
Purpose:
- run the inside-facing lab build from the bastion
Execution model:
- imports:
playbooks/bootstrap/ad-server.ymlplaybooks/bootstrap/idm.ymlplaybooks/bootstrap/idm-ad-trust.ymlplaybooks/bootstrap/bastion-join.ymlplaybooks/lab/mirror-registry.ymlplaybooks/lab/openshift-dns.ymlplaybooks/cluster/openshift-installer-binaries.ymlplaybooks/cluster/openshift-install-artifacts.ymlplaybooks/cluster/openshift-agent-media.ymlplaybooks/cluster/openshift-cluster.ymlplaybooks/cluster/openshift-install-wait.ymlplaybooks/day2/openshift-post-install-validate.ymlplaybooks/day2/openshift-post-install.yml
- intended to run on
bastion-01, not on the operator workstation playbooks/bootstrap/ad-server.ymlis optional and exits early unlesslab_build_ad_server=true- for resilient long-running execution, the project provides:
scripts/run_bastion_playbook.sh- which writes PID, log, and exit-code state under
/var/tmp/bastion-playbooks/
- support VMs (
ad-01,idm-01,bastion-01, andmirror-registry) default to preserving existing disks and libvirt domains on rerun - the support-service bootstrap path now probes for completed AD, IdM, bastion base, and bastion-join state and skips those phases when the existing guests are already converged
- a true fresh support-services rebuild now means both removing the support VMs
and wiping their backing block devices before replaying
./scripts/run_local_playbook.shplaybooks/site-bootstrap.yml - the mirror-registry phase now caches successful mirror completion for the
rendered content set and skips the expensive
oc-mirrorwork on rerun unless forced - the cluster VMs now default to reuse on rerun; destructive cluster rebuilds are an explicit cleanup action rather than the normal replay path
- bastion staging ensures the staged
generated/workspace is writable bycloud-userso repeated installer renders do not fail on ownership drift - bastion staging also seeds a small managed
/etc/hostsfallback for the bootstrap-critical support hostnames and cluster API endpoints, then verifies those names withgetentbefore the long-running orchestration starts - the guest-build playbooks later in this phase consume the same
host_resource_managementdata loaded during bootstrap
playbooks/bootstrap/site.yml
Purpose:
- prepare the AWS metal hypervisor
- install virtualization and networking prerequisites
- build the
lab-switchOpen vSwitch and libvirt topology - configure host routing and NAT
Execution model:
- runs against the metal host group
- loads
vars/global/lab_switch_ports.yml - waits for the full expected disk inventory before host configuration begins
- registers the host with RHSM, disables RHUI repos, and enables CDN repos required for RHEL virtualization and Open vSwitch content
- composes five roles in order:
lab_host_baselab_host_resource_managementlab_switchlab_firewalllab_libvirt
Important behavior:
- loads
vars/global/host_resource_management.yml - installs
machine-gold.slice,machine-silver.slice, andmachine-bronze.slice - applies manager-level systemd
CPUAffinityfor the reserved host CPU pool - intentionally does not enable kernel affinity boot args or
IRQBALANCE_BANNED_CPULISTby default - defines the CPU-pool data later consumed by
virt-installguest definitions
playbooks/bootstrap/ad-server.yml
Purpose:
- provision
ad-01.corp.lan - configure the guest as the optional lab AD DS / AD CS server
Execution model:
- first play runs on the hypervisor and creates the Windows guest with
ad_server - second play waits for the first WinRM listener
- third play configures the guest with
ad_server_guest - final play removes installation media from persistent XML
- in the validated flow, bastion reaches the Windows guest directly on VLAN 100 with WinRM; it does not proxy back through the operator workstation
Important behavior:
- default-disabled behind
lab_build_ad_server: false - uses an EBS-backed system disk at
/dev/ebs/ad-01 - uses Windows Server 2025 media plus
virtio-win.iso - loads only the boot-critical storage and network drivers during Setup
- installs the remaining virtio drivers and
virtio-win-gt-x64.msipost-install over WinRM - configures:
- AD DS
- AD CS
- Web Enrollment
- demo users and groups for the trust-oriented identity story
- exports the AD root CA after successful configuration
- probes an existing guest for completed DC/CA state and skips the build/config path on rerun when that state is already present and no explicit rebuild was requested
Note
The bastion-first AD build is validated. The deeper IdM trust and AD-root/IdM-intermediate PKI path is still follow-on work, not the default documented golden path.
playbooks/bootstrap/idm.yml
Purpose:
- provision
idm-01.workshop.lan - configure the guest as the lab IdM/DNS/CA/KRA server
Execution model:
- first play runs on the hypervisor and creates the VM with the
idmrole - then
add_hostregisters the guest dynamically - second play waits for SSH and configures the guest with
idm_guest - in the validated bastion-first flow, this playbook runs from the bastion after bastion staging and after the optional AD build when enabled
Important behavior:
- applies the Silver guest policy described in RESOURCE MANAGEMENT
- registers the guest with RHSM and Red Hat Insights
- installs the IPA server with the
freeipa.ansible_freeipa.ipaserverrole - enables KRA through the FreeIPA role rather than a hand-driven CLI step
- manages users, groups, password policies, and sudo rules through FreeIPA modules
- enables
authselectwith:with-tlogwith-mkhomedirwith-sudo
- enables
oddjobdso domain-user home directories are created on first login
playbooks/bootstrap/idm-ad-trust.yml
Purpose:
- configure the optional IdM to AD trust after
idm-01and the optional AD VM are both available
Execution model:
- first registers the IdM guest and AD guest as temporary inventory hosts
- configures the AD conditional forwarder for
workshop.lanfrom the Windows side before touching the IdM trust path - then runs the
idm_ad_trustrole onidm-01from the bastion-side flow - exits early unless both
lab_build_ad_server=trueand the AD-trust feature are enabled
Important behavior:
- enables IdM AD-trust server support and the IPA forward zone for the AD domain
- validates both host and LDAP SRV lookups through the new forward zone before creating the trust
- creates the AD trust with bounded retry around transient oddjob/cache issues
- creates the configured IdM external groups and nests them into the target local IdM policy groups
playbooks/bootstrap/bastion.yml
Purpose:
- provision
bastion-01.workshop.lan - create the execution host for the rest of the lab
Execution model:
- first play runs on the hypervisor and creates the VM with the
bastionrole - then
add_hostregisters the guest dynamically - second play waits for SSH and configures the guest with
bastion_guest
Important behavior:
- applies the Bronze guest policy described in RESOURCE MANAGEMENT
- registers the guest with RHSM and Red Hat Insights
- updates all guest packages and reboots when needed
- installs the bastion management package set, including:
cockpit-filescockpit-packagekitcockpit-podmancockpit-session-recordingcockpit-image-builderpcppcp-system-tools
- enables:
cockpit.socketosbuild-composer.socketpmcdpmloggerpmproxy
- enables
oddjobd - intentionally does not join IdM during the initial bastion build
- leaves IdM enrollment to
playbooks/bootstrap/bastion-join.ymlafter IdM is available - probes an existing guest for the completed bastion package/service/tooling baseline and skips the base build/config phases on rerun when that state is already present
playbooks/bootstrap/bastion-stage.yml
Purpose:
- stage the repo and execution inputs onto the bastion
Execution model:
- first play registers the bastion through the hypervisor using
ProxyCommand - second play:
- installs execution prerequisites on the bastion
- synchronizes the repo to the bastion with
rsync - preserves bastion-side
generated/content during refresh - stages the pull secret and hypervisor SSH key
- renders a bastion-local inventory for
172.16.0.1 - installs Ansible collections
- installs pip requirements such as
pywinrmso bastion-native Windows orchestration can talk toad-01 - installs
/etc/profile.d/openshift-bastion.sh - publishes a stable
generated/tools/currentsymlink - creates
$HOME/binand$HOME/etclink sets forcloud-userand current IdMadmins - seeds the same helper layout into
/etc/skelfor future admin logins - verifies SSH from bastion to
virt-01
Operator helper:
scripts/run_local_playbook.sh- launches workstation-side playbooks with tracked PID, log, and exit-code
state under
~/.local/state/calabi-playbooks/ - this is the preferred way to track
site-bootstrap.ymlfrom the operator workstation
- launches workstation-side playbooks with tracked PID, log, and exit-code
state under
scripts/run_remote_bastion_playbook.sh- refreshes bastion staging by running
playbooks/bootstrap/bastion-stage.yml - records the workstation-side validation and bastion-staging phase under
~/.local/state/calabi-playbooks/, then invokes the stagedscripts/run_bastion_playbook.shhelper on the bastion - this is the preferred way to rerun bastion-native playbooks after local repository changes
- restages the selected override file from the workstation repo before the bastion-side play starts, so local override edits are part of the same handoff
- refreshes bastion staging by running
scripts/lab-dashboard.sh- runs on either the operator workstation or bastion
- on the workstation, it reads local tracked state first and then switches to bastion-side runner state after handoff metadata appears
- if
site-lab.ymlis still in local validation orbastion-stage.yml, the bastion dashboard will correctly show nothing yet because the bastion-side runner does not exist until after handoff
Day-2 rerun behavior:
playbooks/day2/openshift-post-install.ymlnow probes the major post-install phases before including their roles- healthy phases are skipped by default on rerun instead of being reapplied
- the guarded phases include:
- disconnected OperatorHub
- infra conversion
- IdM ingress certs
- breakglass auth
- NMState
- ODF
- Keycloak
- OIDC auth
- optional LDAP auth and group sync
- OpenShift Virtualization
- Pipelines
- Web Terminal
- AAP
- Network Observability
- destructive ODF recovery is force-only:
-e openshift_post_install_force_odf_rebuild=true- legacy alias:
-e openshift_post_install_odf_force_osd_device_reset=true
- ODF can now run in either
internalorexternalmode, with the external path creating the imported external-cluster secret and externalStorageClusterin the same ordering slot that internal ODF previously owned
The shell profile installed by bastion-stage:
- prepends
$HOME/binandgenerated/tools/currenttoPATH - exports
KUBECONFIG_ADMIN=$HOME/etc/kubeconfig.local - exports
KUBECONFIG=$HOME/etc/kubeconfigwhen that writable working copy exists, otherwise falls back toKUBECONFIG_ADMIN - keeps the generated cluster artifact kubeconfig as the source snapshot rather than the default mutable login target
- leaves early bastion logins clean even before OpenShift auth artifacts exist
The generated/tools/current path is already a symlink to the active tool
directory. Do not append another /bin component when documenting or
debugging bastion helper links.
playbooks/bootstrap/bastion-join.yml
Purpose:
- join the already-built bastion to IdM after identity services are available
Execution model:
- runs directly on
openshift_bastion - waits for the IdM guest to answer on SSH first
- reuses the existing
bastion_guestrole in join mode rather than creating a second bastion-enrollment implementation
Important behavior:
- refreshes the active IdM CA before enrollment
- runs the FreeIPA client role only when enrollment is required
- does not perform a general guest
dnf updateor reboot; that behavior now stays in the initialsite-bootstrap.ymlprovisioning path sosite-lab.ymldoes not power off its own control host mid-run - enables
authselectwith:with-mkhomedirwith-sudo
- leaves the bastion ready for IdM-backed operator access before
mirror-registryand cluster work begin - probes bastion for completed IdM enrollment and skips the join path on rerun when the existing guest already has the expected client config, DNS, and IPA host record
AD Roles
roles/ad_server
Purpose:
- create, reset, and boot the Windows AD VM on the hypervisor
Important behavior:
- stages the Windows ISO,
virtio-win.iso, and generated unattend media under/var/lib/aws-metal-openshift-demo/ad-01/ - uses
/dev/ebs/ad-01for the system disk - keeps the Windows install path deterministic enough for reruns by isolating boot-critical drivers from later guest-driver work
roles/ad_server_guest
Purpose:
- configure Windows after the first WinRM listener is available
Important behavior:
- installs remaining virtio drivers after the OS is reachable
- installs and starts the QEMU guest agent
- promotes the server to a DC
- configures AD CS and Web Enrollment
- seeds demo users and groups
- exports the root CA
- tolerates reruns after the setup ISOs have been detached by skipping the media-backed virtio step when the guest agent is already installed
playbooks/lab/mirror-registry.yml
Purpose:
- provision
mirror-registry.workshop.lan - join it to IdM
- install and configure the local mirror registry
- prepare and optionally execute disconnected OpenShift content mirroring
Execution model:
- first play runs on the hypervisor and creates the VM with
mirror_registry - it waits for
idm-01first because the guest is domain-joined later - then
add_hostregisters the guest dynamically - second play waits for SSH, gathers facts, and applies
mirror_registry_guest - when run from the bastion, guest communication is direct to
172.16.0.20across VLAN 100; it does not proxy back throughvirt-01
Important behavior:
- applies the Bronze guest policy described in RESOURCE MANAGEMENT
- installs
calabi-shellafter RHSM registration and package access are available, because the system-wide install requiresgit - joins IdM without relying on client-driven dynamic DNS updates for the guest's static address
- reasserts the mirror-registry A/PTR records in authoritative IdM DNS after
enrollment and validates that
dig @idm-01returns the expected address - default disconnected mode is portable:
m2darchive build- followed automatically by
d2mimport into Quay
- writes a success marker tied to the rendered image-set checksum so reruns can skip the expensive mirror step when the content has not changed
playbooks/lab/openshift-dns.yml
Purpose:
- populate OpenShift forward and reverse DNS data in IdM from the cluster matrix
Execution model:
- first play registers
idm-01dynamically from the hypervisor side - the bastion reaches
idm-01directly on VLAN 100 for DNS management tasks - second play connects to the IdM guest and applies
idm_openshift_dns - the publish step now validates authoritative IdM resolution for the newly added OpenShift A and PTR records before the play exits
playbooks/lab/openshift-dns-cleanup.yml
Purpose:
- remove the OpenShift forward and reverse DNS data from IdM
Execution model:
- first play registers
idm-01dynamically from the hypervisor side - second play connects to the IdM guest and applies
idm_openshift_dns_cleanup
playbooks/cluster/openshift-install-artifacts.yml
Purpose:
- render
install-config.yamlandagent-config.yamlfor the current cluster matrix - embed the IdM CA as
additionalTrustBundle - keep installer inputs aligned with the VM shell orchestration
Execution model:
- runs locally on the current execution host
- loads the cluster matrix and VM shell definitions
- validates node names and MAC addresses still match across both datasets
- renders per-node
rootDeviceHints.serialNumbervalues from the cluster VM root-disk serials - reads the local pull secret and SSH public key
- fetches the public IdM CA via the hypervisor
- renders:
generated/ocp/install-config.yamlgenerated/ocp/agent-config.yamlgenerated/ocp/idm-ca.crt
playbooks/cluster/openshift-installer-binaries.yml
Purpose:
- download the exact OpenShift installer/client toolchain for the same release mirrored into the local registry
- prepare local prerequisites needed by
openshift-install agent create image
Execution model:
- runs locally on the current execution host
- reads
mirror_registry_ocp_releasefromvars/global/mirror_content.yml - installs local package prerequisites such as
nmstate - downloads release-specific archives from the OpenShift mirror
- extracts:
openshift-installockubectl
playbooks/cluster/openshift-agent-media.yml
Purpose:
- generate the agent boot ISO for the current cluster definition
- write a generated cluster attachment-plan overlay for the VM-shell workflow
Execution model:
- runs locally on the current execution host
- requires rendered install artifacts to already exist
- uses the downloaded
openshift-installbinary for the pinned release - generates:
generated/ocp/agent.x86_64.isogenerated/ocp/openshift_cluster_attachment_plan.yml
- the resulting overlay is loaded automatically by
playbooks/cluster/openshift-cluster.ymlwhen present
playbooks/cluster/openshift-install-wait.yml
Purpose:
- wait for the agent-based OpenShift install to complete
- recover fresh-install control-plane nodes that stay attached to the agent ISO instead of pivoting to disk
Execution model:
- runs locally on the current execution host, typically bastion
- uses the rendered installer directory and pinned
openshift-installbinary - polls assisted-service from the rendezvous host during bootstrap when
/etc/assisted/rendezvous-host.envis still present; if that bootstrap-only metadata is already gone, the assisted-service probe degrades cleanly rather than failing the play - detects control-plane nodes stuck in
installing-pending-user-action - on those nodes it:
- ejects the agent ISO from libvirt
- restores disk-first boot order
- power-cycles the affected domains
- waits for
bootstrap-complete - if the first
bootstrap-completewait fails, probes the control-plane nodes foragent.service,kubelet.service, andbootkube.service, recovers any node still stuck in agent mode, and retriesbootstrap-completeonce - then probes the control-plane nodes again and recovers any node still stuck in
agent.servicewithoutkubelet.service - only after those checks does it wait for
install-complete
playbooks/maintenance/cleanup.yml
Purpose:
- aggregate destructive cleanup workflows
- support cluster-only cleanup without destroying healthy support services
- optionally remove support guests, IdM ingress cert state, and lab networking
Execution model:
- one play runs
idm_cleanup - one play runs
mirror_registry_cleanup - one play runs
bastion_cleanup - one play runs
openshift_cluster_cleanup - one play removes stale bastion-side tracked state and generated cluster artifacts when the cluster is being rebuilt
- one play runs
openshift_post_install_idm_certs_cleanup - one play runs
lab_cleanup
playbooks/day2/openshift-post-install-idm-certs.yml
Purpose:
- configure OpenShift ingress to use an IdM-issued wildcard certificate
Execution model:
- runs bastion-native and reaches the cluster API plus IdM directly from the inside of the lab
- creates the wildcard DNS/service prerequisites on
idm-01 - imports a custom Dogtag certificate profile for wildcard ingress issuance
- requests an ingress certificate for:
apps.ocp.workshop.lan*.apps.ocp.workshop.lan
- builds the OpenShift ingress secret and patches the default
IngressController - applies the IdM CA into cluster trust so routes and console checks remain healthy
playbooks/day2/openshift-post-install.yml
Purpose:
- coordinate day-2 cluster configuration after initial convergence
Execution model:
- runs bastion-native against the cluster API and supporting in-lab services
- composes post-install roles such as:
- disconnected OperatorHub pivot
- infra node conversion
- IdM ingress certificate integration
HTPasswdbreakglass authentication- Kubernetes NMState deployment
- ODF declarative deployment
- branches on
openshift_post_install_odf_mode internalkeeps the existing Local Storage Operator plus internal Ceph pathexternalinstalls the ODF operator, creates the importedrook-ceph-external-cluster-detailssecret, and creates the externalStorageCluster- both modes occupy the same ordering slot so downstream roles do not gain a new dependency edge
- external mode applies and waits for the Network Operator host-routing
settings before the external
StorageClusteris created when the profile requests it - external mode is intended to leave infra conversion and storage-node tainting disabled unless the profile explicitly needs them
- destructive recovery is skipped on a healthy rerun unless explicitly forced
- destructive recovery wipes the first 2 GiB, fixed BlueStore label
offsets at
0,1,10,100, and1000 GiB, and the device tail - destructive recovery also purges
/var/lib/rook/*and/var/lib/ceph/*on the infra nodes before reinstall - cleans up stale OperatorGroups in the Local Storage namespace before
subscription to avoid OLM
MultipleOperatorGroupsFoundblocks - defaults to the pod network in this nested lab
- does not assume ODF public-network Multus/macvlan is viable by default
- Keycloak deployment after ODF storage is available
- OpenShift OIDC auth through Keycloak while preserving breakglass access
- optional legacy LDAP auth and group sync, disabled by default
- OpenShift Virtualization deployment
- OpenShift Pipelines and Windows image-build lane setup
- Web Terminal installation
- AAP deployment and Keycloak OIDC integration
- Network Observability and Loki deployment
- validation
- the disconnected OperatorHub phase also checks that every cluster node can
resolve
mirror-registry.workshop.lanbefore the mirrored CatalogSource pods are applied, so registry DNS failures surface before the pull attempt
playbooks/day2/openshift-post-install-pipelines.yml
Purpose:
- install Red Hat OpenShift Pipelines
- prepare a namespace-local Windows EFI image-build lane for OpenShift Virtualization
Execution model:
- runs bastion-native against the cluster API
- installs
openshift-pipelines-operator-rhfrom the mirroredcs-redhat-operator-index-v4-20catalog source - waits for:
Subscription- operator
CSV TektonPipeline/pipelineTektonConfig/config
- ensures
ocs-storagecluster-ceph-rbdis the default clusterStorageClassso Tekton Results can provision its PostgreSQL PVC - creates the
windows-image-buildernamespace - binds the
pipelineservice account to theeditrole in that namespace - downloads and applies the Red Hat
windows-efi-installerpipeline manifest for the current 4.20 catalog stream - renders a reusable example
PipelineRunmanifest into the execution workspace without launching a Windows build automatically
playbooks/day2/openshift-windows-server-build.yml
Purpose:
- render and launch a parameterized Windows Server build with the installed
windows-efi-installerpipeline
Execution model:
- runs bastion-native against the cluster API
- requires OpenShift Pipelines and the
windows-efi-installerpipeline to already be installed - renders a Windows Server 2022
PipelineRunto:generated/windows/windows-server-2022-pipelinerun.yaml
- applies that
PipelineRunintowindows-image-builder - waits for the
PipelineRunto start
Operational note:
- the playbook intentionally refuses to run until
openshift_windows_build_iso_urlis set to a real Windows Server ISO URL
playbooks/day2/openshift-post-install-web-terminal.yml
Purpose:
- install the OpenShift Web Terminal Operator so the console shell is available
Execution model:
- installs the Red Hat
web-terminaloperator inopenshift-operators - relies on operator dependency resolution to install DevWorkspace
- waits for the Web Terminal CSV to reach
Succeeded - waits for the Web Terminal and DevWorkspace pods to become ready
- builds a custom tooling image on
mirror-registry.workshop.lan - pushes that image to
mirror-registry.workshop.lan:8443/init/web-terminal-tooling-custom:latest - merges mirror-registry auth into the cluster pull-secret
- rewrites
DevWorkspaceTemplate/web-terminal-toolingto use the mirrored custom image
playbooks/day2/openshift-post-install-nmstate.yml
Purpose:
- install the Kubernetes NMState operator and create the cluster
NMStateinstance early in the day-2 flow
Execution model:
- installs
kubernetes-nmstate-operatorinopenshift-nmstate - creates
NMState/nmstate - waits for the NMState namespace pods to become ready
- if the
nmstate-handlerdaemonset is not fully ready, captures diagnostics, recycles only the non-ready handler pods, and retries once - applies NodeNetworkConfigurationPolicies for:
- VLAN
202as the OpenShift Virtualization live-migration network - VLANs
300,301, and302as VM data networks
- VLAN
- currently uses interface-name matching with
enp1s0as the parent uplink - if the NodeNetworkConfigurationPolicies stay
Progressing, captures daemonset and policy diagnostics, recycles only the non-ready handler pods, and rechecks policy availability once before failing - is intended to run after LDAP/infra conversion and before ODF or later networking day-2 work
Design note:
- nmstate can also match the parent uplink by MAC address
- that approach is more robust for heterogeneous fleets, but it requires per-node policy generation because each node MAC is different
- the current lab intentionally keeps the simpler shared, name-based policy shape for teaching clarity
playbooks/day2/openshift-post-install-aap.yml
Purpose:
- install Red Hat Ansible Automation Platform and integrate it with the Keycloak/IdM auth path
Execution model:
- installs
ansible-automation-platform-operatorin namespaceaap - creates
AnsibleAutomationPlatform/workshop-aap - keeps controller enabled and hub, EDA, and Lightspeed disabled
- uses
ocs-storagecluster-ceph-rbdfor the embedded PostgreSQL storage - creates an IdM CA bundle secret using the required key name
bundle-ca.crt - creates or updates the Keycloak
aapclient in the existing realm - creates or updates the Keycloak
groupsandaap-audienceprotocol mappers - creates the
Red Hat build of Keycloakgateway authenticator - creates the
access-openshift-admin AAP superuserauthenticator map - removes the legacy direct-LDAP authenticator when present
- validates AD-backed OIDC login after the gateway rollout when trust is enabled, otherwise validates the native IdM user path
Validated live result:
- route
https://aap.apps.ocp.workshop.lan - login page shows
Red Hat build of Keycloak ad-ocpadmin@corp.lanauthenticates through Keycloak/IdM on the live trust path- the resulting AAP user has
is_superuser: true - a clean AAP teardown and redeploy was revalidated on the same OIDC path
playbooks/day2/openshift-post-install-virtualization.yml
Purpose:
- install OpenShift Virtualization for nested VM workloads on the OpenShift worker nodes
Execution model:
- installs the Red Hat
kubevirt-hyperconvergedoperator inopenshift-cnv - creates
HyperConverged/kubevirt-hyperconverged - sets
ocs-storagecluster-ceph-rbdas the default virt storage class and wires it intoHyperConverged.spec.vmStateStorageClass - installs
node-healthcheck-operatorfrom the mirroredcs-redhat-operator-index-v4-20catalog source inopenshift-workload-availability - installs
fence-agents-remediationfrom the mirroredcs-redhat-operator-index-v4-20catalog source inopenshift-workload-availability - uses a cluster-wide
OperatorGroupfor the workload-availability namespace, which matches the operators' supportedAllNamespacesinstall mode - contains disabled-by-default scaffolding for:
NodeHealthCheckFenceAgentsRemediationTemplate
- waits for:
- the operator CSV to reach
Succeeded HyperConvergedAvailable=TrueKubeVirtAvailable=Truevirt-handlerdaemonset readiness on the worker nodes- Node Health Check and FAR controller deployments to become
Available
- the operator CSV to reach
Validated live result:
- CSV
kubevirt-hyperconverged-operator.v4.20.7Succeeded KubeVirtphaseDeployedvirt-handler3/3node-healthcheck-operator.v0.10.0Succeededfence-agents-remediation.v0.7.0Succeeded
playbooks/day2/openshift-post-install-netobserv.yml
Purpose:
- install Network Observability and its Loki backend after ODF is available
Execution model:
-
installs the Red Hat
loki-operatorinopenshift-operators-redhat -
installs the Red Hat
netobserv-operatorinopenshift-netobserv-operator -
creates an ODF-backed
ObjectBucketClaiminnetobserv -
converts the generated OBC secret/configmap into the Loki object-store secret
-
deploys a
LokiStackwithtenants.mode: openshift-network -
places Loki components on the ODF storage nodes so the small worker nodes are not CPU-starved
-
applies a
FlowCollectorusing the default eBPF path, not the unsupportedEbpfManagerfeature Operational note: -
NetObserv Topology is a summarized flow graph, not a raw throughput meter
-
use
iperf3output and pod/interface counters as the authoritative proof of line rate
playbooks/day2/openshift-post-install-validate.yml
Purpose:
- validate cluster health from inside the lab boundary
Execution model:
- stages the pinned
ocbinary and the generated kubeconfig onvirt-01 - runs cluster checks there instead of on the outside control node
- verifies node, operator, CSR, and cluster version state
- refreshes the bastion helper kubeconfigs from the current generated cluster kubeconfig after validation succeeds
- refreshes the bastion helper
kubeadmin-passwordalongside the kubeconfigs - refreshes the helper copies for
cloud-useras well as the IdMadmins - exports
configmap/kube-root-ca.crtfrom the live cluster and installs that bundle into bastion system trust sooc loginworks without--insecure-skip-tls-verify
Data Files
vars/global/lab_switch_ports.yml
Purpose:
- canonical definition of the switchports/VLANs
- readable classroom-facing inventory of VLAN intent
This file drives:
- OVS internal interface creation
- host SVI routing assignment
- firewalld zone membership for routed VLANs
- libvirt network VLAN portgroup behavior
vars/guests/idm_vm.yml
Purpose:
- authoritative data model for the IdM VM and its guest configuration
Carries:
- VM identity and disk path
- CPU/memory
- access/login/cloud-init data
- RHSM registration data
- IPA install configuration
- Cockpit/session recording settings
- IPA groups/users used later for OpenShift demos
vars/guests/mirror_registry_vm.yml
Purpose:
- authoritative data model for the mirror-registry VM and registry service
Carries:
- VM identity and disk path
- RHSM registration and guest access data
- IPA client enrollment settings
- registry install paths and ports
- IdM-cert toggle
- bootstrap user settings
- tool download URLs
vars/global/mirror_content.yml
Purpose:
- declarative disconnected content model for OpenShift mirroring
Carries:
- mirror execution toggles
mirror_mode- low-level workflow selection
- destination registry namespace
- auth file paths
- workspace/archive paths
- OpenShift platform channel/version
- operator catalog
- operator package/channel list
vars/guests/openshift_cluster_vm.yml
Purpose:
- authoritative data model for the nested OpenShift VM shells
Carries:
- node names
- VM sizing
- root block-device paths
- additional ODF block-device paths for infra nodes
- libvirt network and portgroup selection
- optional future agent boot-media attachment data
vars/cluster/openshift_install_cluster.yml
Purpose:
- authoritative install matrix for OpenShift identity, VIPs, DNS, and per-node addressing
Carries:
- cluster name and full domain
- API and ingress VIPs
- API/API-int/apps DNS names
- shared service endpoints
- per-node FQDN, MAC, machine-network IP, and storage-network IP
Project Dependencies
requirements.yml
Purpose:
- declare Ansible collection dependencies required by the project
Current dependency:
freeipa.ansible_freeipa
Host Roles
lab_host_base
Purpose:
- establish host package/service baseline
- enable the OVS repo required on RHEL 10.1
- disable conflicting defaults
Critical tasks:
- enforces the hypervisor hostname/FQDN and local host entry
- enables
fast-datapath-for-rhel-10-x86_64-rpms - installs:
cockpit-filescockpit-machinescockpit-podmancockpit-session-recordingcockpit-image-builderopenvswitch3.6libvirtqemu-kvmvirt-install- guestfs and support tools
firewalldpcppcp-system-tools
- enables libvirt/OVS/firewalld units that are present
- enables the PCP services used by Cockpit metrics:
pmcdpmloggerpmproxy
- disables
nftablesservice so firewalld owns policy - destroys, disables, and undefines the default libvirt network
- persists kernel forwarding/rp_filter settings in
/etc/sysctl.d
Why it matters:
- the lab depends on OVS from the Red Hat fast datapath repo
- the default
virbr0network would conflict with the intended topology
lab_switch
Purpose:
- materialize the OVS side of the lab topology
Critical tasks:
- renders
/usr/local/sbin/aws-metal-openshift-demo-net.sh - installs
/etc/systemd/system/aws-metal-openshift-demo-net.service - enables and runs the oneshot service
What the generated script does conceptually:
- creates OVS bridge
lab-switch - creates internal OVS interfaces for VLAN-backed host SVIs
- labels switchports with descriptions from the switchport map
- restores the switch after reboot via systemd
Why it matters:
- this is the actual switch-like substrate for access/trunk behavior
lab_firewall
Purpose:
- configure routed behavior the Red Hat way with firewalld
Critical tasks:
- creates the
labzone - sets the zone target to
ACCEPT - enables intra-zone forwarding
- places routed VLAN interfaces in
lab - places the uplink in
external - enables masquerade on
external - reloads firewalld only when state changed
Why it matters:
- it replaces the earlier direct nftables approach
- it is the host-side northbound NAT path for guests
lab_libvirt
Purpose:
- define the libvirt network that maps guest interfaces into
lab-switch
Critical tasks:
- renders
/etc/libvirt/lab-switch.xml - stops and undefines any stale libvirt definition when the XML changes
- defines, starts, and autostarts the
lab-switchnetwork
Why it matters:
- libvirt portgroups are how VMs are attached as access/trunk style ports
lab_cleanup
Purpose:
- revert the host to a simple post-package state without the lab switch
Critical tasks:
- destroys/undefines the libvirt
lab-switchnetwork - stops/disables the
aws-metal-openshift-demo-net.service - deletes OVS bridge
lab-switch - removes the
labfirewalld zone - removes external masquerade
- disables IPv4 forwarding at runtime
- deletes generated unit/script/XML artifacts
Why it matters:
- the environment is intentionally rebuildable and disposable
IDM Roles
idm
Purpose:
- provision the IdM guest on the hypervisor
Critical tasks:
- validates orchestration data
- validates RHSM inputs if guest subscription is enabled
- checks the block device exists
- optionally checks the source QCOW2 image
- renders cloud-init:
meta-datauser-datanetwork-config
- optionally writes the base image to the raw block device with
qemu-img convert - rereads the partition table
- builds the cloud-init ISO with
xorriso - defines the guest with
virt-install
Important behavior:
- image paths are evaluated on the hypervisor, not on the control node
- the primary NIC is attached through libvirt to VLAN 100 via
mgmt-access
idm_guest
Purpose:
- configure the guest into a functioning IdM server for the workshop
Critical tasks:
- updates the guest and reboots if needed
- installs the IdM, Cockpit, and session-recording packages
- enables persistent journald storage
- enables
cockpit.socket - manages guest firewall services
- runs
ipa-server-installif the server is not already configured - acquires a Kerberos admin ticket for follow-up IPA CLI work
- applies DNS forwarders with
ipa dnsconfig-mod - creates static infrastructure DNS records in the base
workshop.lanzone for management access, includingvirt-01.workshop.lan - configures named to allow query, recursion, and cache access from all lab CIDRs defined in the OVS VLAN model
- creates IPA groups and users for workshop use
- creates group-scoped IPA password policies for the seeded lab-user groups
- creates an IPA sudo rule for the
adminsgroup that permits passwordless execution of any command on any host - resets the managed user passwords
- resets seeded-user password expiration explicitly because IdM password policy is not retroactive
- manages IPA group membership
- installs KRA support
- installs
ipa-server-trust-ad - renders named extension fragments to define trusted networks and recursion policy
- configures all-user session recording with
authselect/SSSD/tlog
Key outputs of this role:
- authoritative DNS for
workshop.lan - base-domain infrastructure records such as
virt-01.workshop.lan -> 172.16.0.1 - named ACLs and options that allow authoritative queries, recursion, and cached responses from all lab VLAN CIDRs
- trusted recursion for all lab CIDRs
- IdM users/groups for later OpenShift auth demos
- group-scoped lab password policy for
admins,openshift-admin,virt-admin, anddeveloper - an
admins-nopasswd-allIPA sudo rule that grantsadminsunrestricted passwordlesssudoon enrolled lab hosts - Cockpit and session recording enabled
idm_cleanup
Purpose:
- remove the IdM VM cleanly
Critical tasks:
- destroys the domain if present
- undefines the domain including NVRAM
- removes the runtime directory
- can optionally wipe the start of the raw block device
idm_openshift_dns
Purpose:
- create the OpenShift forward and reverse DNS data in IdM from the cluster matrix
Critical tasks:
- creates the cluster forward zone
- creates the machine-network and storage-network reverse zones
- creates API and ingress VIP A and PTR records
- creates forward and reverse records for all 9 OpenShift nodes
- creates distinct storage-side names for VLAN 201 addresses
Implementation note:
- this role uses the official
freeipa.ansible_freeipacollection
idm_openshift_dns_cleanup
Purpose:
- remove the OpenShift forward and reverse DNS data from IdM
Critical tasks:
- removes all managed API, ingress, node, and storage DNS records
- removes the reverse zones
- removes the forward cluster zone
Implementation note:
- this role also uses the official
freeipa.ansible_freeipacollection
mirror_registry_cleanup
Purpose:
- remove the mirror-registry VM cleanly
Critical tasks:
- destroys the domain if present
- undefines the domain including NVRAM
- removes the runtime directory
- can optionally wipe the start of the raw block device
OpenShift Cluster Roles
openshift_cluster
Purpose:
- build the nested OpenShift cluster VM shells on the hypervisor
Critical tasks:
- validates node definitions
- validates root and additional block devices
- optionally validates requested agent boot media
- checks for existing libvirt domains
- optionally recreates existing domains
- defines the guests with
virt-install - defaults guest graphics to
vncon127.0.0.1so Cockpit can be used as the primary visual boot console
Current modeled topology:
- 3 control-plane nodes
- 3 infra nodes
- 3 worker nodes
- all attached to
lab-switchvia portgroupocp-trunk - infra nodes also receive one extra ODF data disk each
Why it matters:
- it creates the reusable pre-install VM scaffolding needed for the later agent-based OpenShift workflow
openshift_cluster_cleanup
Purpose:
- remove only the OpenShift cluster shell VMs
Critical tasks:
- checks whether each OpenShift domain exists
- destroys each domain if present
- undefines each domain and its NVRAM
- wipes root and additional block devices by default
- this default was adopted after live validation showed that libvirt-domain teardown without disk wiping could leave enough state behind to interfere with repeatable fresh agent/RHCOS boot testing
Why it matters:
- it provides a fast reset of the disposable cluster layer while preserving the expensive support services and host networking
Mirror Registry Roles
mirror_registry
Purpose:
- provision
mirror-registry.workshop.lanon the hypervisor
Critical tasks:
- validates orchestration and RHSM data
- checks the mirror-registry block device exists
- inspects the raw block device partition table
- auto-detects whether an empty disk must be seeded from the base image
- renders cloud-init metadata/user-data/network-config
- optionally writes the base image to the raw block device
- rereads the partition table
- builds the cloud-init ISO
- defines the guest with
virt-install
Important behavior:
- unlike the IDM role, this role auto-detects an empty raw disk and seeds it
- that was added because the first build attempt booted an empty EBS volume
mirror_registry_guest
Purpose:
- make the mirror-registry guest usable for disconnected OpenShift mirroring
This is currently the densest role in the project.
Critical tasks, phase by phase:
1. Guest baseline
- updates the guest and reboots if needed
- installs client and registry packages
- ensures firewalld is enabled
- on RHEL 10, creates
/etc/containers/containers.conf.d/and forces root Podman to usecgroupfsbefore the Quay appliance install so the registry containers do not stall under the defaultsystemdcgroup manager
2. IdM integration
- checks whether the guest is already enrolled
- waits for the IdM HTTPS endpoint
- runs
ipa-client-install --force-join
Why it matters:
- the registry host becomes part of the lab identity model
3. Registry filesystem/workspace prep
- creates Quay paths
- creates download and extraction paths
- creates auth/workspace/archive paths used by
oc-mirror
4. Certificate workflow
- if
use_idm_cert: true- ensures
certmongeris running - ensures the
HTTP/<fqdn>service principal exists in IPA - requests and tracks an IdM-issued certificate with
ipa-getcert - stages the certificate chain and key under
/var/lib/mirror-registry/install-certs - installs a helper that restarts the Quay containers when cert material changes
- ensures
Important nuance:
- future fresh builds default to IdM-issued certs
- the appliance installer now consumes staged cert material rather than reading from the mutable Quay config directory
5. Tooling install
- downloads and extracts:
mirror-registryopenshift-clientoc-mirror
- stages the full mirror-registry bundle
- installs
oc,kubectl, andoc-mirror
6. Quay install and trust
- installs Quay if it is not already healthy
- skips appliance reinstall when the registry health endpoint already returns
200 - removes stale
redis_passsecret before appliance reinstall attempts - trusts either:
- the mirror-registry self-signed root CA, or
- the IdM CA
- updates both system trust and container runtime trust
- waits for port
8443before attempting registry login - logs in to the local registry with Podman
7. Disconnected content modeling
- derives the effective mirroring workflow from:
mirror_mode- optionally the low-level
workflow
- renders
imageset-config.yaml - validates auth and archive prerequisites based on workflow
- merges the Red Hat pull secret with the local registry auth when needed
- builds the right
oc-mirror --v2command line - passes the configured mirror parallelism to
oc-mirror:mirror_registry_oc_mirror_parallel_imagesmirror_registry_oc_mirror_parallel_layers
- can run:
- direct registry mirroring
m2dd2m
Current supported mirror_mode wrapper values:
direct- direct connected mirroring
portablem2d
importd2m
advanced- use the explicit low-level workflow value
Current disconnected content intent:
- OpenShift 4.20 platform
- ODF family
- OpenShift Logging
- Loki
- Network Observability
Current live status:
- controller-side pull secret copy is implemented
- merged auth generation is implemented
oc-mirrordry-run has passedportable(m2d) completed successfully on the live guest- the observed archive for
4.20.15plus the current operator set was about95 GiB - same-host practical sizing was revised to
400 GiBformirror-registry - mirror guest memory was revised to
8192MiB with 4 vCPU; the mirror job is normally quiet after build and theoc-mirrorparallelism is now explicit and tunable
Operational helper:
- bastion installs
/usr/local/bin/track-mirror-progress - bastion installs
/usr/local/bin/track-mirror-progress-tmux - it reports:
- current runner state
- latest Ansible task
- guest root filesystem usage
- archive/workspace size
- active
oc-mirrorprocess state - simple sizing guidance derived from the observed archive size
- subsequent runs also write guest-side logs such as:
/var/log/oc-mirror-m2d.log/var/log/oc-mirror-d2m.log
- imported registry payload growth should be checked in the Quay Podman volume,
especially:
/var/lib/containers/storage/volumes/quay-storage/_data
- observed imported Quay content footprint after
d2m: about82 GiB
Validation Practices In Use
The project has been routinely checked with:
ansible-playbook --syntax-checkansible-lintyamllintshellcheckwhere appropriatemake validate/scripts/validate-orchestration.sh
The validation lane now also includes scoped play-contract checks so cross-play variable leaks are caught before runtime, especially in the workstation-to-bastion handoff and support-service phases.
The host prep, IdM build, mirror-registry build, and OpenShift DNS populate/cleanup workflows have all been exercised on a real AWS metal host.
The recent orchestration hardening also moved several late failures forward:
- support-service DNS publication is now validated authoritatively from IdM
- disconnected OperatorHub checks node-side resolution of
mirror-registry.workshop.lanbefore mirrored catalogs are applied - fresh-install control-plane bootstrap recovery is codified in
playbooks/cluster/openshift-install-wait.yml
The current cluster/auth proof points are stronger than the original docs used to claim:
- the default cluster auth baseline is now
HTPasswdbreakglass plus Keycloak OIDC, not direct OpenShift LDAP - the post-install replay path has completed cleanly on the current cluster
- repo-wide
ansible-lint -pis clean
Known Gaps
Note
These are acknowledged design boundaries, not bugs. They are tracked here so contributors know where the automation stops and manual decisions begin.
- In-place migration of an already-running self-signed registry to IdM-issued certs is only partially automated. Clean installs that start on the IdM-cert path are not the gap.
- The final certification bar is still outstanding: one uninterrupted
playbooks/site-lab.ymlrun from a deliberate teardown boundary on the current codebase, without live repair work during the attempt.