Issues Ledger
Executive summary:
- all issues listed below are closed
- the cluster build plus current day-2/auth path have been validated on a live environment
- recent fresh-path rebuilds have continued to expose additional orchestration
defects, so the final zero-intervention end-to-end
playbooks/site-lab.ymlcertification run is still pending - this page is now a historical record of fixes, not an open-triage queue
This ledger records problems that showed up during real rebuild and validation work and were fixed in Git.
Table Of Contents
How To Use This Page
Use this as the closed-case companion to INVESTIGATING.
When something breaks:
- check whether it is still an open investigation
- check whether it already failed before in a similar phase
- jump to the fixing commit if the symptom matches
Phase Legend
Bootstrap: outer host preparation and first-hop orchestrationIdM: identity, DNS, CA, and FreeIPA workBastion: bastion execution model and staging behaviorSupport VM: shared VM lifecycle behavior outside the cluster nodesMirror: disconnected content, Quay, and mirror workflow behaviorCluster: day-1 install and nested-cluster shell behaviorDay-2: post-install platform and operator configurationDocs: documentation synchronization and operator guidance
Issues Table
| Phase | Issue | Symptom / Impact | Fix Summary | Commit |
|---|---|---|---|---|
| Bootstrap | Uncommitted rebuild drift and wrong hypervisor target state | Rebuild confidence was low because orchestration changes were not committed, inventory pointed at the wrong virt-01 address, and OVS trunking did not match the intended VLAN set. |
Committed the pending orchestration work, updated inventory to the live host, and aligned OpenShift trunk ports with the intended VLAN model. | fe08d77 |
| Bootstrap | Fresh-host bootstrap and bastion handoff fragility | Fresh virt-01 rebuilds exposed RHSM, bastion staging, and handoff problems that made the first reflection checkpoint brittle. |
Hardened host bootstrap, fresh-host registration behavior, /dev/ebs handling, and bastion staging/handoff. |
cbfa279 |
| Bastion | Bastion-local execution ambiguity | Lab runs looked like they were still using the workstation as the control node, which obscured where failures were actually happening. | Documented and normalized the bastion-local execution model for site-lab.yml. |
91f9e5f |
| IdM | Immediate IdM password expiry for seeded users | IdM users such as sysop could authenticate inconsistently because their passwords expired immediately after provisioning. |
Added a group-scoped IdM password-policy model and documented the non-retroactive behavior. | 5bcd4e1 |
| IdM | Missing passwordless sudo policy for IdM admins | The intended admins group privilege model was not present, so lab admins still needed authentication or lacked a consistent host-wide rule. |
Added the IdM admins sudo policy and enrolled sysop accordingly. |
c321b69 |
| IdM | virt-01 not resolvable on the management SVI from IdM |
Bastion-side SOCKS/Cockpit workflows could not reliably resolve virt-01.workshop.lan to the management VLAN interface. |
Added base-domain IdM DNS records for virt-01.workshop.lan -> 172.16.0.1. |
2c13974 |
| IdM | IdM DNS policy too narrow for lab networks | Recursive/query behavior from all OVS-routed lab CIDRs was not explicitly allowed. | Extended IdM named policy to allow lab-network query, recursion, and cache access. | 8a4f8a8 |
| Bastion | Bastion missing Cockpit non-virtualization modules and PCP services | Bastion Cockpit lacked the expected files/packagekit/podman/session-recording/image-builder modules and PCP-backed metrics. | Expanded bastion guest bootstrap to install the Cockpit add-ons and PCP services. | ed43ff4 |
| Support VM | Support guests retained cloud-init media too long | Support VMs could carry cidata longer than intended, leaving sensitive bootstrap data attached until later cleanup. |
Moved support-guest CD-ROM cleanup before the first update-triggered reboot. | 6693358 |
| Bastion | Bastion staging drift and day-2 NMState fragility | Tar/unarchive staging left the bastion with stale project content across reruns, and the NMState uplink assumption was not stable. | Switched bastion staging to rsync, preserved bastion-side generated content, and hardened the NMState/VLAN flow. |
7eb86ae |
| Cluster | OpenShift install media remained attached after provisioning | Cluster nodes retained agent.x86_64.iso, creating a reboot timebomb that could re-enter day-1 installer mode. |
Moved OpenShift install-media cleanup earlier and fixed the validation/cleanup path so agent ISOs are detached and boot order returns to disk. | df873d8 |
| Mirror | Disconnected OperatorHub pivot failed from the bastion | OperatorHub pivot orchestration still assumed the wrong execution context and mirrored-artifact pathing. | Fixed the disconnected OperatorHub role for bastion execution and mirrored-catalog consumption. | 1b5e610 |
| Mirror | Disconnected mirror workflow defaulted to direct mirror-to-registry | The default content path was still geared toward partially disconnected validation instead of strict m2d/d2m. |
Changed the documented and orchestrated default to the portable workflow and recorded sizing guidance. | f73d231 |
| Mirror | No practical visibility into long-running oc-mirror phases |
Operators could not easily tell whether m2d/d2m was active, stalled, or complete. |
Added the mirror progress helper, tmux dashboard, and guest-side oc-mirror log paths. |
cc04b7c |
| Mirror | Quay payload footprint was being judged from the wrong path | /opt/quay-install looked tiny even after import because the real content lives in Podman volumes, which obscured sizing conclusions. |
Documented the real Quay payload location and the observed imported footprint. | d048e0a |
| IdM | Hand-rolled IdM server, KRA, and identity-management tasks were brittle on rebuild | Fresh idm-01 rebuilds exposed KRA hangs, DNS-forwarder validation surprises, and weak idempotence around user/group/policy/sudo management. |
Replaced the server/KRA install path with the FreeIPA server role and moved users, groups, password policies, and sudo rules onto FreeIPA modules. | 18c1733 |
| Bootstrap | Fresh-host login ergonomics and disconnected day-2 defaults were incomplete | Fresh AWS hypervisors could leave ec2-user locked for Cockpit, RHEL guests were not consistently enrolled into Insights, IdM client hosts were missing mkhomedir/SSSD sudo integration, bastion users did not land with usable OpenShift tools or kubeconfig wiring, and the post-install baseline still left the disconnected OperatorHub pivot off by default. |
Unlocked ec2-user in first-boot and bootstrap, added Insights registration to all RHEL hosts after RHSM, enabled oddjobd plus authselect with-mkhomedir with-sudo on IdM client hosts, added the bastion helper shell environment and home-directory link layout, and made the disconnected OperatorHub pivot part of the default post-install baseline. |
ef28ebf |
| Mirror | Portable mirror workflow did not run d2m after m2d | Running mirror_mode: portable pulled content to the disk archive (m2d) but never pushed it into the local Quay registry (d2m). A second manual invocation with mirror_mode: import was required, but that second pass was not orchestrated. |
Added a d2m import phase to the mirror_registry_guest role that runs automatically after m2d when mirror_mode == 'portable'. |
95d7420 |
| Mirror | Mirror registry CD-ROM eject failed on re-runs | Re-running the mirror-registry playbook after the initial build failed because virsh change-media --eject returned an error when the cloud-init CD-ROM device had already been detached on a prior run. |
Made the CD-ROM eject and XML removal tasks idempotent by tolerating "No disk found" / "did not find" errors. | 95d7420 |
| Mirror | MetalLB operator missing from mirrored catalog | The MetalLB operator was not included in the mirrored operator set, so it could not be installed from the disconnected OperatorHub. | Added metallb-operator (stable channel) to vars/global/mirror_content.yml. |
95d7420 |
| Day-2 | LDAP smoke test used wrong TLS certificate authority | The oc login in the post-LDAP smoke test passed the IdM CA as --certificate-authority, but the OCP API server cert is issued by the cluster's own CA — not IdM. This caused every background run to fail immediately with "unknown authority". |
Replaced --certificate-authority=<idm-ca> with --insecure-skip-tls-verify for the smoke test login. |
2bb11d7 |
| Day-2 | Infra node conversion role only labeled nodes | The openshift_post_install_infra role applied the node-role.kubernetes.io/infra label but did not move platform workloads — ingress, monitoring, and registry remained on worker nodes. |
Added IngressController nodePlacement patch, cluster-monitoring-config ConfigMap, and imageregistry nodeSelector patch. Added convergence waits for ingress pods Running and monitoring CO Progressing=False. Fixed ansible.builtin.command stdin limitation by switching the ConfigMap apply to ansible.builtin.shell with a heredoc. |
4419acb |
| Day-2 | LDAP smoke test did not wait for OAuth pod convergence | The smoke test oc login could fire while the authentication cluster operator was still rolling out new OAuth pods with the LDAP IDP config. The CO's Available=True gate (present since role creation) is not sufficient — the CO can be Available via old pods while new pods are still loading. |
Added a Progressing=False wait (60 retries × 10 s) immediately before the smoke test so the login only runs once the OAuth pods have fully reloaded the IDP configuration. |
0612a82 |
| Docs | manual-process.md out of sync with automation codebase |
Stale project paths, wrong tool symlink paths, duplicate section numbering, missing workload movement in LDAP/infra section, upstream catalog source names instead of disconnected ones, OperatorHub pivot still marked as optional, missing metallb-operator in mirror content. |
Comprehensive rewrite: fixed paths, renumbered sections 23-36, added infra workload movement + LDAP smoke test + OAuth convergence wait to section 24, updated all subscription sources to disconnected catalog names, changed OperatorHub pivot to default baseline, added metallb-operator and push_and_run.sh subsection. | 564b052 |
| Day-2 | Eight day-2 roles used wrong playbook_dir path for oc and kubeconfig |
Roles openshift_post_install_virtualization, web_terminal, pipelines, netobserv, aap, idm_certs, idm_certs_cleanup, openshift_windows_server_build, and lab_suspend all derived the local oc binary and kubeconfig paths using {{ playbook_dir }}/generated/.... Since playbook_dir resolves to playbooks/day2/, this produced a non-existent path and every role failed its pre-flight assertion on the first run. |
Replaced all nine occurrences with {{ lab_execution_tools_root }} and {{ lab_execution_ocp_generated_root }}, matching the pattern used by the NMState and infra roles. |
07e6cab |
| Day-2 | ODF console plugin wait referenced odf-client-console which was removed in 4.20 |
The openshift_post_install_odf_console_plugins default included odf-client-console, but that plugin was dropped from ODF 4.20. The wait task timed out after 5 minutes waiting for a resource that will never appear. |
Removed odf-client-console from the defaults list, leaving only odf-console. |
75ed2c2 |
| Day-2 | ODF console plugin wait used .results inside a loop + retries context |
The until clause on "Wait for ODF console plugin resources to exist" referenced register_var.results, but .results is only populated after all loop iterations complete — it does not exist during per-item retry evaluation. This caused an Ansible template evaluation error on every retry. |
Changed until to rc == 0. |
07e6cab |
| Day-2 | OLM blocked by duplicate OperatorGroup in openshift-local-storage |
A second OperatorGroup named openshift-local-storage-9kzfh existed in the namespace alongside the one created by the role. OLM refuses to process subscriptions when MultipleOperatorGroupsFound, so no InstallPlan was ever created and the LSO CSV wait timed out. |
Added a cleanup task in openshift_post_install_odf that deletes any OperatorGroups in openshift-local-storage whose name differs from the role-managed one, running before the subscription is applied. |
007c920 |
| Day-2 | Day-2 execution order: ODF must precede Virtualization and NetObserv | OpenShift Virtualization annotates ocs-storagecluster-ceph-rbd-virtualization as the default virt storage class; NetObserv requires a NooBaa S3 bucket for Loki. Both operations fail if ODF has not been deployed. The original openshift-post-install.yml task order put ODF after Virtualization, Web Terminal, and Pipelines. |
Moved ODF to run immediately after NMState in openshift-post-install.yml. Updated manual-process.md to match: ODF is now section 26, Virtualization 27, Web Terminal 28. |
446c456 |
| Day-2 | Disconnected OperatorHub pivot applied only the Red Hat catalog; community catalog never created | The pivot role rendered and applied a single catalogsource.yaml.j2 hardcoded to cs-redhat-operator-index-v4-20. The community-operator-index catalog (cc-redhat-operator-index-v4-20) was never applied, so NHC and FAR subscriptions stalled with targeted catalogsource openshift-marketplace/cc-redhat-operator-index-v4-20 missing. The community catalog image was present in the mirror registry — only the CatalogSource object was missing. |
Extended the pivot role to drive all per-catalog operations (render, apply, attach pull secret, SA patch, delete pod, wait READY) via a loop over openshift_disconnected_operatorhub_catalog_configs, which now includes both the Red Hat and community catalog entries. Applied the community CatalogSource manually to unblock the current run. |
369d8d8 |
| Cluster | Agent ISO cleanup regressed — all 9 cluster VMs retained install media after provisioning | Commit d82defe refactored site-lab.yml to separate install-wait from day-2, but dropped the validation/cleanup import that df873d8 had placed between them. The cleanup ended up at the end of openshift-post-install.yml (day-2 validation role), which only runs after all operators complete. All 9 cluster VMs kept agent.x86_64.iso on sdc with cdrom first in boot order. Rebooting ocp-infra-01 for a vCPU resize caused it to re-enter the day-1 agent installer instead of booting from disk. |
Restored the cleanup by adding import_playbook: maintenance/detach-install-media.yml to site-lab.yml immediately after openshift-install-wait.yml. Added media detach as the first task in openshift-post-install.yml as a safety net. Manually ejected media and fixed boot order on all 9 VMs from the hypervisor. |
007c920 |
| Day-2 | Infra nodes undersized at 12 vCPU for full day-2 workload | With ODF (Ceph MDS, MON, OSD, MGR, RGW), ingress controller, monitoring stack, image registry, and LokiStack all targeting infra nodes, 12 vCPU per infra node (11500m allocatable) was fully saturated at 95-96% CPU requests. LokiStack compactor and ingester pods could not schedule. | Increased infra node vCPU allocation from 12 to 16 in vars/guests/openshift_cluster_vm.yml. Rolling resize of live VMs via virt-xml --edit --vcpus 16 + graceful reboot. |
007c920 |
| Day-2 | FlowCollector openShiftAutoDetect rendered as YAML string instead of boolean |
The flowcollector.yaml.j2 template used a >- YAML block folded scalar for the openShiftAutoDetect field. The Jinja2-rendered value (true) was parsed by the YAML loader as a string, causing spec.processor.subnetLabels.openShiftAutoDetect: Invalid value: "string": ... must be of type boolean. |
Replaced the multi-line >- block scalar with an inline `{{ "{{" }} ... |
bool |
| Day-2 | AAP LDAP authenticator curl hung with wrong -u value |
The Create AAP LDAP authenticator when absent task used argv: [curl, -sku, -k, admin:<password>, ...]. In combined short flags -sku, the -u flag consumed the next arg (-k) as the username:password, while the real credentials became a positional URL argument. Curl hung trying to resolve admin:<password> as a URL scheme. The shell-based curl calls (lookup tasks) worked fine because curl -sku admin:<password> -k ... is parsed correctly in bash. |
Removed the redundant -k from the argv (already included in -sku), so the next arg after -sku is the credentials. |
007c920 |
| Day-2 | AAP controller-web race condition — deployment NotFound immediately after CR existence wait | The Wait for AAP controller deployment to exist task polled for the automationcontroller CR becoming available, then immediately ran oc rollout status deployment/workshop-aap-controller-web. The deployment object had not yet been created by the operator, causing NotFound on every attempt. The race hit twice across two playbook runs. |
Added an explicit Wait for AAP controller-web deployment object to exist task (90 retries × 10 s) between the CR wait and the rollout status call, matching the pattern already used for the gateway deployment. |
007c920 |
| Day-2 | IdM ingress wildcard cert could not be issued via ipa cert-request |
ipa cert-request validates every CSR subject CN and DNS SAN against Kerberos principals. *.apps.ocp.workshop.lan can never be a valid Kerberos principal, so all wildcard cert requests are unconditionally rejected. A preceding issue also caused ipa service-add to silently fail because the apps.ocp.workshop.lan host object did not exist in IdM (it's a VIP, not a registered client); service-add --force alone does not skip the host existence check. |
Rewrote request-idm-certs.sh.j2 to bypass ipa cert-request entirely. IPA housekeeping (DNS A record, ipa host-add --force, ipa service-add, cert profile import) is still done via the IPA CLI. The CSR is then signed directly with OpenSSL using the IdM CA signing key extracted from the Dogtag NSS database (pk12util → openssl pkcs12). -copy_extensions copyall preserves the wildcard SAN from the CSR. The extracted key is wiped immediately after signing. The signed cert is SCP'd back to the control node rather than parsed from stdout. Changed openshift_post_install_enable_idm_ingress_certs default to true. |
cd64473 |
| Day-2 | NHC and FAR operators sourced from community catalog instead of Red Hat | The node-healthcheck-operator and fence-agents-remediation subscriptions pointed at cc-redhat-operator-index-v4-20 (community-operator-index). Both operators are available in redhat-operator-index with Red Hat support and security updates. Using the community versions meant no Red Hat support coverage. |
Moved both operators from community-operator-index to redhat-operator-index in the mirror content list, updated the role defaults to use cs-redhat-operator-index-v4-20, and updated the manual process and orchestration guide. |
bc886a3 |
| Day-2 | Infra workloads (ingress, monitoring) unable to schedule on ODF-tainted infra nodes | ODF adds node.ocs.openshift.io/storage: NoSchedule to infra nodes when StorageCluster is deployed. The infra conversion role patched IngressController, cluster-monitoring-config, and imageregistry to use infra node selectors but did not add tolerations for the OCS taint. New router pods, prometheus, alertmanager, kube-state-metrics, and metrics-server all stayed Pending indefinitely, causing ingress, monitoring, and console COs to go degraded. |
Added node.ocs.openshift.io/storage: NoSchedule toleration to IngressController nodePlacement, all 7 cluster-monitoring-config components, and imageregistry. |
e5ee4bc |
| Day-2 | IdM cert pivot at phase 11 caused extended CO degradation | The IdM ingress cert was applied after all operators were deployed (phase 11). Every CO that had already established trust with the default ingress-operator self-signed cert had to re-adapt mid-flight. Console CO health check failed for 28 minutes. The role's convergence waits only checked Available=True, missing COs that were Available=True but still Degraded=True. |
Moved IdM cert pivot to phase 3 (after infra_conversion, before LDAP). Replaced weak Available-only waits with full convergence gates: oc rollout status for router deployment, three-condition checks (Available/Degraded/Progressing) for ingress/auth/console COs, and an openssl s_client TLS assertion that retries until the IdM cert is actually being served and verified. |
44a51e8 |
| Day-2 | OSD prepare jobs rejected disks due to stale Ceph bluestore backup labels | Fresh cluster rebuild failed to provision OSDs on all three infra nodes: ceph-volume raw list returned osd.X: UUID belonging to a different ceph cluster "53c3cf43-..." on each infra device. The first 100 MB and last 100 MB of each NVMe OSD disk had been zeroed (as is standard) but ceph-volume raw list (Ceph Reef / ODF 4.20) still detected the old metadata. Scanning the raw NVMe devices (/dev/nvme11n1, nvme14n1, nvme7n1) on the hypervisor revealed that Ceph Reef stores bluestore label backup copies at 1 GiB and 8 GiB from the start of the device in addition to offset 0 and the tail. These positions are not covered by the conventional 100 MB head+tail wipe. The QEMU disk uses cache='none' io='native', so there is no host-level caching to blame; the data truly persisted on the EBS-backed NVMe volumes. |
Added a Wipe Ceph bluestore label positions on OSD backing devices task to openshift_post_install_odf that runs unconditionally at the start of the role (before any OLM work). The task iterates over openshift_cluster_nodes infra entries with odf-data additional disks and zeros 16 MB at offsets 0, 1 GiB, 8 GiB, and 100 GiB (configurable via openshift_post_install_odf_bluestore_wipe_offsets_mb). Safe on fresh EBS volumes; idempotent on reused ones. |
007c920 |
| Cluster | Cluster VM virt-install rendered an invalid agent ISO argument on rebuild |
The cluster VM template emitted --disk path=/var/lib/libvirt/images/agent.x86_64.iso,device=cdrom,bus=scsi,target.dev=sdc, which virt-install 5.0.0 rejected as unrecognized arguments. The same template also collapsed adjacent --disk lines during Jinja rendering, so the agent ISO spec and the preceding disk spec could merge into one malformed argument. |
Removed the invalid target.dev from the agent ISO attachment and forced explicit line breaks between adjacent --disk entries in virt-install-command.j2. |
37cce56 |
| Cluster | Hardcoded root-disk hint pointed OpenShift to the wrong install disk | The generated agent-config.yaml hardcoded rootDeviceHints.hctl: "0:0:0:3", but the current cluster VMs exposed the install disk at a different guest-visible location. Bootstrap failed on master nodes with Requested installation disk is not part of the host's valid disks. |
Added unique libvirt serials for the cluster root disks and rendered rootDeviceHints.serialNumber per node instead of a brittle hctl hint. |
09b7191 |
| Support VM | Reruns recreated support VMs instead of preserving them | Restarting the main orchestration from the top kept reseeding idm-01, bastion-01, and mirror-registry, turning every retry into an expensive support-stack rebuild instead of a resumed rollout. |
Switched support-VM defaults to preserve existing disks and documented the supported resume path that skips the mirror-registry rebuild when the support stack is already healthy. | 20d4031 |
| Day-2 | ODF reruns destroyed healthy clusters and re-wiped disks on ordinary reruns | The post-install ODF role still treated every rerun like a fresh build, so a healthy StorageCluster could be torn down and ODF backing disks could be wiped again during a normal day-2 replay. |
Added health-first guards across the ODF path, made destructive recovery force-only, and codified the host-side rook/ceph cleanup plus the revised BlueStore wipe positions. | 33363da |
| Day-2 | Shared day-2 health probes assumed role-local defaults that were not loaded in the consolidated play | The centralized rerun-safety probe file referenced NMState, ODF, AAP, and other role-local variables directly. When the shared day-2 play called those probes outside the role context, Ansible failed on undefined variables before any real work started. | Moved the probe inputs onto explicit probe-local defaults and audited the shared day-2 health checks so they no longer depend on role defaults being preloaded. | 5604060 |
| Day-2 | AAP LDAP authenticator create/read-after-write path failed open and then timed out | The AAP LDAP authenticator POST could succeed without surfacing HTTP failures, and the follow-up lookup relied on delayed list consistency. On reruns the authenticator was created, but the lookup never found it in time, so the day-2 flow stopped after LDAP had already been configured. | Hardened the authenticator POST to fail closed with curl --fail-with-body, captured the creation response body, and seeded the authenticator ID directly from the create response while also hardening the adjacent authenticator-map call. |
6c2712b |
| Day-2 | AAP clean-build auth path was still centered on direct LDAP and diverged from the cluster SSO model | AAP had a separate direct-LDAP path, late gateway API failures, and no clean proof that AD-backed users could log in through the same Keycloak/IdM chain used by OpenShift. | Refactored AAP onto Keycloak OIDC, added the Keycloak client plus groups and aap audience protocol mappers, managed the gateway authenticator and superuser map, removed the legacy LDAP object path, and validated AD-backed login on both a repaired run and a clean AAP redeploy. |
e520733 |