Issues Ledger

Executive summary:

all issues listed below are closed
the cluster build plus current day-2/auth path have been validated on a live environment
the zero-intervention end-to-end ./scripts/run_remote_bastion_playbook.sh playbooks/site-lab.yml certification run was achieved on v1.2.0
this page is now a historical record of fixes, not an open-triage queue

This ledger records problems that showed up during real rebuild and validation work and were fixed in Git.

Use INVESTIGATING for final disposition and release-closure notes. Use this page when you want the concrete symptom, the fix summary, and the commit that resolved it.

How To Use This Page
Phase Legend
Issues Table

How To Use This Page

Use this as the closed-case companion to INVESTIGATING.

When something breaks:

check whether it is still an open investigation
check whether it already failed before in a similar phase
jump to the fixing commit if the symptom matches

If the current release is already known-good and you only need the historical fix record, skip straight to the issues table.

Phase Legend

Bootstrap: outer host preparation and first-hop orchestration
IdM: identity, DNS, CA, and FreeIPA work
Bastion: bastion execution model and staging behavior
Support VM: shared VM lifecycle behavior outside the cluster nodes
Mirror: disconnected content, Quay, and mirror workflow behavior
Cluster: day-1 install and nested-cluster shell behavior
Day-2: post-install platform and operator configuration
Docs: documentation synchronization and operator guidance

Issues Table

Phase	Issue	Symptom / Impact	Fix Summary	Commit
Bootstrap	Uncommitted rebuild drift and wrong hypervisor target state	Rebuild confidence was low because orchestration changes were not committed, inventory pointed at the wrong `virt-01` address, and OVS trunking did not match the intended VLAN set.	Committed the pending orchestration work, updated inventory to the live host, and aligned OpenShift trunk ports with the intended VLAN model.	`fe08d77`
Bootstrap	Fresh-host bootstrap and bastion handoff fragility	Fresh `virt-01` rebuilds exposed RHSM, bastion staging, and handoff problems that made the first reflection checkpoint brittle.	Hardened host bootstrap, fresh-host registration behavior, `/dev/ebs` handling, and bastion staging/handoff.	`cbfa279`
Bastion	Bastion-local execution ambiguity	Lab runs looked like they were still using the workstation as the control node, which obscured where failures were actually happening.	Documented and normalized the bastion-local execution model for `site-lab.yml`.	`91f9e5f`
IdM	Immediate IdM password expiry for seeded users	IdM users such as `sysop` could authenticate inconsistently because their passwords expired immediately after provisioning.	Added a group-scoped IdM password-policy model and documented the non-retroactive behavior.	`5bcd4e1`
IdM	Missing passwordless sudo policy for IdM admins	The intended `admins` group privilege model was not present, so lab admins still needed authentication or lacked a consistent host-wide rule.	Added the IdM `admins` sudo policy and enrolled `sysop` accordingly.	`c321b69`
IdM	`virt-01` not resolvable on the management SVI from IdM	Bastion-side SOCKS/Cockpit workflows could not reliably resolve `virt-01.workshop.lan` to the management VLAN interface.	Added base-domain IdM DNS records for `virt-01.workshop.lan -> 172.16.0.1`.	`2c13974`
IdM	IdM DNS policy too narrow for lab networks	Recursive/query behavior from all OVS-routed lab CIDRs was not explicitly allowed.	Extended IdM named policy to allow lab-network query, recursion, and cache access.	`8a4f8a8`
Bastion	Bastion missing Cockpit non-virtualization modules and PCP services	Bastion Cockpit lacked the expected files/packagekit/podman/session-recording/image-builder modules and PCP-backed metrics.	Expanded bastion guest bootstrap to install the Cockpit add-ons and PCP services.	`ed43ff4`
Support VM	Support guests retained cloud-init media too long	Support VMs could carry `cidata` longer than intended, leaving sensitive bootstrap data attached until later cleanup.	Moved support-guest CD-ROM cleanup before the first update-triggered reboot.	`6693358`
Bastion	Bastion staging drift and day-2 NMState fragility	Tar/unarchive staging left the bastion with stale project content across reruns, and the NMState uplink assumption was not stable.	Switched bastion staging to `rsync`, preserved bastion-side generated content, and hardened the NMState/VLAN flow.	`7eb86ae`
Cluster	OpenShift install media remained attached after provisioning	Cluster nodes retained `agent.x86_64.iso`, creating a reboot timebomb that could re-enter day-1 installer mode.	Moved OpenShift install-media cleanup earlier and fixed the validation/cleanup path so agent ISOs are detached and boot order returns to disk.	`df873d8`
Mirror	Disconnected OperatorHub pivot failed from the bastion	OperatorHub pivot orchestration still assumed the wrong execution context and mirrored-artifact pathing.	Fixed the disconnected OperatorHub role for bastion execution and mirrored-catalog consumption.	`1b5e610`
Mirror	Disconnected mirror workflow defaulted to direct mirror-to-registry	The default content path was still geared toward partially disconnected validation instead of strict `m2d`/`d2m`.	Changed the documented and orchestrated default to the portable workflow and recorded sizing guidance.	`f73d231`
Mirror	No practical visibility into long-running `oc-mirror` phases	Operators could not easily tell whether `m2d`/`d2m` was active, stalled, or complete.	Added the mirror progress helper, tmux dashboard, and guest-side `oc-mirror` log paths.	`cc04b7c`
Mirror	Quay payload footprint was being judged from the wrong path	`/opt/quay-install` looked tiny even after import because the real content lives in Podman volumes, which obscured sizing conclusions.	Documented the real Quay payload location and the observed imported footprint.	`d048e0a`
IdM	Hand-rolled IdM server, KRA, and identity-management tasks were brittle on rebuild	Fresh `idm-01` rebuilds exposed KRA hangs, DNS-forwarder validation surprises, and weak idempotence around user/group/policy/sudo management.	Replaced the server/KRA install path with the FreeIPA server role and moved users, groups, password policies, and sudo rules onto FreeIPA modules.	`18c1733`
Bootstrap	Fresh-host login ergonomics and disconnected day-2 defaults were incomplete	Fresh AWS hypervisors could leave `ec2-user` locked for Cockpit, RHEL guests were not consistently enrolled into Insights, IdM client hosts were missing `mkhomedir`/SSSD sudo integration, bastion users did not land with usable OpenShift tools or kubeconfig wiring, and the post-install baseline still left the disconnected OperatorHub pivot off by default.	Unlocked `ec2-user` in first-boot and bootstrap, added Insights registration to all RHEL hosts after RHSM, enabled `oddjobd` plus `authselect with-mkhomedir with-sudo` on IdM client hosts, added the bastion helper shell environment and home-directory link layout, and made the disconnected OperatorHub pivot part of the default post-install baseline.	`ef28ebf`
Mirror	Portable mirror workflow did not run d2m after m2d	Running `mirror_mode: portable` pulled content to the disk archive (m2d) but never pushed it into the local Quay registry (d2m). A second manual invocation with `mirror_mode: import` was required, but that second pass was not orchestrated.	Added a d2m import phase to the `mirror_registry_guest` role that runs automatically after m2d when `mirror_mode == 'portable'`.	`95d7420`
Mirror	Mirror registry CD-ROM eject failed on re-runs	Re-running the mirror-registry playbook after the initial build failed because `virsh change-media --eject` returned an error when the cloud-init CD-ROM device had already been detached on a prior run.	Made the CD-ROM eject and XML removal tasks idempotent by tolerating "No disk found" / "did not find" errors.	`95d7420`
Mirror	MetalLB operator missing from mirrored catalog	The MetalLB operator was not included in the mirrored operator set, so it could not be installed from the disconnected OperatorHub.	Added `metallb-operator` (stable channel) to `vars/global/mirror_content.yml`.	`95d7420`
Day-2	LDAP smoke test used wrong TLS certificate authority	The `oc login` in the post-LDAP smoke test passed the IdM CA as `--certificate-authority`, but the OCP API server cert is issued by the cluster's own CA — not IdM. This caused every background run to fail immediately with "unknown authority".	Replaced `--certificate-authority=<idm-ca>` with `--insecure-skip-tls-verify` for the smoke test login.	`2bb11d7`
Day-2	Infra node conversion role only labeled nodes	The `openshift_post_install_infra` role applied the `node-role.kubernetes.io/infra` label but did not move platform workloads — ingress, monitoring, and registry remained on worker nodes.	Added IngressController nodePlacement patch, cluster-monitoring-config ConfigMap, and imageregistry nodeSelector patch. Added convergence waits for ingress pods Running and monitoring CO Progressing=False. Fixed `ansible.builtin.command` stdin limitation by switching the ConfigMap apply to `ansible.builtin.shell` with a heredoc.	`4419acb`
Day-2	LDAP smoke test did not wait for OAuth pod convergence	The smoke test `oc login` could fire while the authentication cluster operator was still rolling out new OAuth pods with the LDAP IDP config. The CO's `Available=True` gate (present since role creation) is not sufficient — the CO can be Available via old pods while new pods are still loading.	Added a `Progressing=False` wait (60 retries × 10 s) immediately before the smoke test so the login only runs once the OAuth pods have fully reloaded the IDP configuration.	`0612a82`
Docs	`manual-process.md` out of sync with automation codebase	Stale project paths, wrong tool symlink paths, duplicate section numbering, missing workload movement in LDAP/infra section, upstream catalog source names instead of disconnected ones, OperatorHub pivot still marked as optional, missing `metallb-operator` in mirror content.	Comprehensive rewrite: fixed paths, renumbered sections 23-36, added infra workload movement + LDAP smoke test + OAuth convergence wait to section 24, updated all subscription sources to disconnected catalog names, changed OperatorHub pivot to default baseline, added metallb-operator and push_and_run.sh subsection.	`564b052`
Day-2	Eight day-2 roles used wrong `playbook_dir` path for `oc` and `kubeconfig`	Roles `openshift_post_install_virtualization`, `web_terminal`, `pipelines`, `netobserv`, `aap`, `idm_certs`, `idm_certs_cleanup`, `openshift_windows_server_build`, and `lab_suspend` all derived the local oc binary and kubeconfig paths using `{{ playbook_dir }}/generated/...`. Since `playbook_dir` resolves to `playbooks/day2/`, this produced a non-existent path and every role failed its pre-flight assertion on the first run.	Replaced all nine occurrences with `{{ lab_execution_tools_root }}` and `{{ lab_execution_ocp_generated_root }}`, matching the pattern used by the NMState and infra roles.	`07e6cab`
Day-2	ODF console plugin wait referenced `odf-client-console` which was removed in 4.20	The `openshift_post_install_odf_console_plugins` default included `odf-client-console`, but that plugin was dropped from ODF 4.20. The wait task timed out after 5 minutes waiting for a resource that will never appear.	Removed `odf-client-console` from the defaults list, leaving only `odf-console`.	`75ed2c2`
Day-2	ODF console plugin wait used `.results` inside a `loop` + `retries` context	The `until` clause on "Wait for ODF console plugin resources to exist" referenced `register_var.results`, but `.results` is only populated after all loop iterations complete — it does not exist during per-item retry evaluation. This caused an Ansible template evaluation error on every retry.	Changed `until` to `rc == 0`.	`07e6cab`
Day-2	OLM blocked by duplicate OperatorGroup in `openshift-local-storage`	A second `OperatorGroup` named `openshift-local-storage-9kzfh` existed in the namespace alongside the one created by the role. OLM refuses to process subscriptions when `MultipleOperatorGroupsFound`, so no `InstallPlan` was ever created and the LSO CSV wait timed out.	Added a cleanup task in `openshift_post_install_odf` that deletes any OperatorGroups in `openshift-local-storage` whose name differs from the role-managed one, running before the subscription is applied.	`007c920`
Day-2	Day-2 execution order: ODF must precede Virtualization and NetObserv	OpenShift Virtualization annotates `ocs-storagecluster-ceph-rbd-virtualization` as the default virt storage class; NetObserv requires a NooBaa S3 bucket for Loki. Both operations fail if ODF has not been deployed. The original `openshift-post-install.yml` task order put ODF after Virtualization, Web Terminal, and Pipelines.	Moved ODF to run immediately after NMState in `openshift-post-install.yml`. Updated `manual-process.md` to match: ODF is now section 26, Virtualization 27, Web Terminal 28.	`446c456`
Day-2	Disconnected OperatorHub pivot applied only the Red Hat catalog; community catalog never created	The pivot role rendered and applied a single `catalogsource.yaml.j2` hardcoded to `cs-redhat-operator-index-v4-20`. The community-operator-index catalog (`cc-redhat-operator-index-v4-20`) was never applied, so NHC and FAR subscriptions stalled with `targeted catalogsource openshift-marketplace/cc-redhat-operator-index-v4-20 missing`. The community catalog image was present in the mirror registry — only the CatalogSource object was missing.	Extended the pivot role to drive all per-catalog operations (render, apply, attach pull secret, SA patch, delete pod, wait READY) via a loop over `openshift_disconnected_operatorhub_catalog_configs`, which now includes both the Red Hat and community catalog entries. Applied the community CatalogSource manually to unblock the current run.	`369d8d8`
Cluster	Agent ISO cleanup regressed — all 9 cluster VMs retained install media after provisioning	Commit `d82defe` refactored `site-lab.yml` to separate install-wait from day-2, but dropped the validation/cleanup import that `df873d8` had placed between them. The cleanup ended up at the end of `openshift-post-install.yml` (day-2 validation role), which only runs after all operators complete. All 9 cluster VMs kept `agent.x86_64.iso` on sdc with `cdrom` first in boot order. Rebooting `ocp-infra-01` for a vCPU resize caused it to re-enter the day-1 agent installer instead of booting from disk.	Restored the cleanup by adding `import_playbook: maintenance/detach-install-media.yml` to `site-lab.yml` immediately after `openshift-install-wait.yml`. Added media detach as the first task in `openshift-post-install.yml` as a safety net. Manually ejected media and fixed boot order on all 9 VMs from the hypervisor.	`007c920`
Day-2	Infra nodes undersized at 12 vCPU for full day-2 workload	With ODF (Ceph MDS, MON, OSD, MGR, RGW), ingress controller, monitoring stack, image registry, and LokiStack all targeting infra nodes, 12 vCPU per infra node (11500m allocatable) was fully saturated at 95-96% CPU requests. LokiStack compactor and ingester pods could not schedule.	Increased infra node vCPU allocation from 12 to 16 in `vars/guests/openshift_cluster_vm.yml`. Rolling resize of live VMs via `virt-xml --edit --vcpus 16` + graceful reboot.	`007c920`
Day-2	FlowCollector `openShiftAutoDetect` rendered as YAML string instead of boolean	The `flowcollector.yaml.j2` template used a `>-` YAML block folded scalar for the `openShiftAutoDetect` field. The Jinja2-rendered value (`true`) was parsed by the YAML loader as a string, causing `spec.processor.subnetLabels.openShiftAutoDetect: Invalid value: "string": ... must be of type boolean`.	Replaced the multi-line `>-` block scalar with an inline `{{ "{{" }} ...	bool
Day-2	AAP LDAP authenticator curl hung with wrong `-u` value	The `Create AAP LDAP authenticator when absent` task used `argv: [curl, -sku, -k, admin:<password>, ...]`. In combined short flags `-sku`, the `-u` flag consumed the next arg (`-k`) as the username:password, while the real credentials became a positional URL argument. Curl hung trying to resolve `admin:<password>` as a URL scheme. The shell-based curl calls (lookup tasks) worked fine because `curl -sku admin:<password> -k ...` is parsed correctly in bash.	Removed the redundant `-k` from the argv (already included in `-sku`), so the next arg after `-sku` is the credentials.	`007c920`
Day-2	AAP controller-web race condition — deployment NotFound immediately after CR existence wait	The `Wait for AAP controller deployment to exist` task polled for the `automationcontroller` CR becoming available, then immediately ran `oc rollout status deployment/workshop-aap-controller-web`. The deployment object had not yet been created by the operator, causing NotFound on every attempt. The race hit twice across two playbook runs.	Added an explicit `Wait for AAP controller-web deployment object to exist` task (90 retries × 10 s) between the CR wait and the rollout status call, matching the pattern already used for the gateway deployment.	`007c920`
Day-2	IdM ingress wildcard cert could not be issued via `ipa cert-request`	`ipa cert-request` validates every CSR subject CN and DNS SAN against Kerberos principals. `*.apps.ocp.workshop.lan` can never be a valid Kerberos principal, so all wildcard cert requests are unconditionally rejected. A preceding issue also caused `ipa service-add` to silently fail because the `apps.ocp.workshop.lan` host object did not exist in IdM (it's a VIP, not a registered client); `service-add --force` alone does not skip the host existence check.	Rewrote `request-idm-certs.sh.j2` to bypass `ipa cert-request` entirely. IPA housekeeping (DNS A record, `ipa host-add --force`, `ipa service-add`, cert profile import) is still done via the IPA CLI. The CSR is then signed directly with OpenSSL using the IdM CA signing key extracted from the Dogtag NSS database (`pk12util` → `openssl pkcs12`). `-copy_extensions copyall` preserves the wildcard SAN from the CSR. The extracted key is wiped immediately after signing. The signed cert is SCP'd back to the control node rather than parsed from stdout. Changed `openshift_post_install_enable_idm_ingress_certs` default to `true`.	`cd64473`
Day-2	NHC and FAR operators sourced from community catalog instead of Red Hat	The `node-healthcheck-operator` and `fence-agents-remediation` subscriptions pointed at `cc-redhat-operator-index-v4-20` (community-operator-index). Both operators are available in `redhat-operator-index` with Red Hat support and security updates. Using the community versions meant no Red Hat support coverage.	Moved both operators from `community-operator-index` to `redhat-operator-index` in the mirror content list, updated the role defaults to use `cs-redhat-operator-index-v4-20`, and updated the manual process and orchestration guide.	`bc886a3`
Day-2	Infra workloads (ingress, monitoring) unable to schedule on ODF-tainted infra nodes	ODF adds `node.ocs.openshift.io/storage: NoSchedule` to infra nodes when StorageCluster is deployed. The infra conversion role patched IngressController, cluster-monitoring-config, and imageregistry to use infra node selectors but did not add tolerations for the OCS taint. New router pods, prometheus, alertmanager, kube-state-metrics, and metrics-server all stayed Pending indefinitely, causing ingress, monitoring, and console COs to go degraded.	Added `node.ocs.openshift.io/storage: NoSchedule` toleration to IngressController nodePlacement, all 7 cluster-monitoring-config components, and imageregistry.	`e5ee4bc`
Day-2	IdM cert pivot at phase 11 caused extended CO degradation	The IdM ingress cert was applied after all operators were deployed (phase 11). Every CO that had already established trust with the default ingress-operator self-signed cert had to re-adapt mid-flight. Console CO health check failed for 28 minutes. The role's convergence waits only checked `Available=True`, missing COs that were `Available=True` but still `Degraded=True`.	Moved IdM cert pivot to phase 3 (after infra_conversion, before LDAP). Replaced weak Available-only waits with full convergence gates: `oc rollout status` for router deployment, three-condition checks (Available/Degraded/Progressing) for ingress/auth/console COs, and an `openssl s_client` TLS assertion that retries until the IdM cert is actually being served and verified.	`44a51e8`
Day-2	OSD prepare jobs rejected disks due to stale Ceph bluestore backup labels	Fresh cluster rebuild failed to provision OSDs on all three infra nodes: `ceph-volume raw list` returned `osd.X: UUID belonging to a different ceph cluster "53c3cf43-..."` on each infra device. The first 100 MB and last 100 MB of each NVMe OSD disk had been zeroed (as is standard) but `ceph-volume raw list` (Ceph Reef / ODF 4.20) still detected the old metadata. Scanning the raw NVMe devices (`/dev/nvme11n1`, `nvme14n1`, `nvme7n1`) on the hypervisor revealed that Ceph Reef stores bluestore label backup copies at 1 GiB and 8 GiB from the start of the device in addition to offset 0 and the tail. These positions are not covered by the conventional 100 MB head+tail wipe. The QEMU disk uses `cache='none' io='native'`, so there is no host-level caching to blame; the data truly persisted on the EBS-backed NVMe volumes.	Added a `Wipe Ceph bluestore label positions on OSD backing devices` task to `openshift_post_install_odf` that runs unconditionally at the start of the role (before any OLM work). The task iterates over `openshift_cluster_nodes` infra entries with `odf-data` additional disks and zeros 16 MB at offsets 0, 1 GiB, 8 GiB, and 100 GiB (configurable via `openshift_post_install_odf_bluestore_wipe_offsets_mb`). Safe on fresh EBS volumes; idempotent on reused ones.	`007c920`
Cluster	Cluster VM `virt-install` rendered an invalid agent ISO argument on rebuild	The cluster VM template emitted `--disk path=/var/lib/libvirt/images/agent.x86_64.iso,device=cdrom,bus=scsi,target.dev=sdc`, which `virt-install 5.0.0` rejected as `unrecognized arguments`. The same template also collapsed adjacent `--disk` lines during Jinja rendering, so the agent ISO spec and the preceding disk spec could merge into one malformed argument.	Removed the invalid `target.dev` from the agent ISO attachment and forced explicit line breaks between adjacent `--disk` entries in `virt-install-command.j2`.	`37cce56`
Cluster	Hardcoded root-disk hint pointed OpenShift to the wrong install disk	The generated `agent-config.yaml` hardcoded `rootDeviceHints.hctl: "0:0:0:3"`, but the current cluster VMs exposed the install disk at a different guest-visible location. Bootstrap failed on master nodes with `Requested installation disk is not part of the host's valid disks`.	Added unique libvirt serials for the cluster root disks and rendered `rootDeviceHints.serialNumber` per node instead of a brittle `hctl` hint.	`09b7191`
Support VM	Reruns recreated support VMs instead of preserving them	Restarting the main orchestration from the top kept reseeding `idm-01`, `bastion-01`, and `mirror-registry`, turning every retry into an expensive support-stack rebuild instead of a resumed rollout.	Switched support-VM defaults to preserve existing disks and documented the supported resume path that skips the mirror-registry rebuild when the support stack is already healthy.	`20d4031`
Day-2	ODF reruns destroyed healthy clusters and re-wiped disks on ordinary reruns	The post-install ODF role still treated every rerun like a fresh build, so a healthy `StorageCluster` could be torn down and ODF backing disks could be wiped again during a normal day-2 replay.	Added health-first guards across the ODF path, made destructive recovery force-only, and codified the host-side rook/ceph cleanup plus the revised BlueStore wipe positions.	`33363da`
Day-2	Shared day-2 health probes assumed role-local defaults that were not loaded in the consolidated play	The centralized rerun-safety probe file referenced NMState, ODF, AAP, and other role-local variables directly. When the shared day-2 play called those probes outside the role context, Ansible failed on undefined variables before any real work started.	Moved the probe inputs onto explicit probe-local defaults and audited the shared day-2 health checks so they no longer depend on role defaults being preloaded.	`5604060`
Day-2	AAP LDAP authenticator create/read-after-write path failed open and then timed out	The AAP LDAP authenticator POST could succeed without surfacing HTTP failures, and the follow-up lookup relied on delayed list consistency. On reruns the authenticator was created, but the lookup never found it in time, so the day-2 flow stopped after LDAP had already been configured.	Hardened the authenticator POST to fail closed with `curl --fail-with-body`, captured the creation response body, and seeded the authenticator ID directly from the create response while also hardening the adjacent authenticator-map call.	`6c2712b`
Day-2	AAP clean-build auth path was still centered on direct LDAP and diverged from the cluster SSO model	AAP had a separate direct-LDAP path, late gateway API failures, and no clean proof that AD-backed users could log in through the same Keycloak/IdM chain used by OpenShift.	Refactored AAP onto Keycloak OIDC, added the Keycloak client plus `groups` and `aap` audience protocol mappers, managed the gateway authenticator and superuser map, removed the legacy LDAP object path, and validated AD-backed login on both a repaired run and a clean AAP redeploy.	`e520733`

Calabi

Issues Ledger

Table Of Contents

How To Use This Page

Phase Legend

Issues Table

Continue