Architecture & Pipeline¶
How a golden image goes from ci/matrix.json to a promoted vSphere template.
Components¶
| Piece | Where | Role |
|---|---|---|
ci/matrix.json |
this repo | Single source of truth — one entry per OS line. |
build-templates.yml |
.github/workflows |
The build pipeline (weekly cron + manual dispatch). |
build.sh |
this repo | Wraps packer init/validate/build per OS line. |
packer-runner image |
ghcr.io/codelooks-com/packer-runner |
Pinned toolchain (Packer, Ansible, govc, mise). |
ARC scale set packer-vsphere |
LukeEvansTech/talos-cluster |
Ephemeral runner pods (min 0 / max 3). |
validate.yml / upload-isos.yml |
.github/workflows |
packer validate every entry (PRs + pushes to main); staging ISOs to the datastore. |
The matrix¶
Each line in ci/matrix.json carries: key, enabled, os/dist/version,
build_dir, config, base_name, iso_datastore_path, and either ISO
discovery fields (iso_url/sums_url/discover for Linux) or manual-media
fields (Windows). Windows entries also carry edition, only
(restrict to one Packer source), and timeout_minutes.
The plan job filters the matrix:
jq -c --arg sel "$SELECTED" \
'[ .[] | select((($sel == "all") and .enabled) or ($sel == .key)) ]' \
ci/matrix.json
- A named dispatch key (
-f os=windows-server-2025) builds exactly that line — even one shippedenabled: false. all/ the weekly cron build every line withenabled: true.
Scheduling¶
- Build: weekly, Saturday 02:00 UTC (
build-templates.yml),allenabled lines,max-parallel: 2(at most two build VMs hold a DHCP lease at once). Runs serialize via thebuild-templatesconcurrency group. - ISO currency:
check-iso-updates.yml, Monday 06:00 UTC — when a newer Linux point release is published (discovery via the matrixdiscoverblock), it commits theiso_url/sums_url+iso_filebump straight tomain(no PR, no approval gate) and dispatchesupload-isosfor each bumped line, so the datastore is staged before Saturday's rebuild.validate.ymlre-runspacker validateon the push tomain, so the bump is still checked.
GitHub scheduler caveats
Neither is an approval gate, but both affect unattended runs:
- Timing is best-effort. Cron is frequently delayed under load — runs have started hours after the nominal time. Don't rely on the exact minute.
- 60-day auto-disable. GitHub disables scheduled workflows after ~60 days
with no repository activity; re-enable from the Actions tab if it
happens. The Monday ISO commit to
mainnormally keeps the repo active enough to prevent this.
Build flow¶
- Runner spins up — ARC scales the
packer-vsphereset from 0; the pod runs thepacker-runnerimage (digest-pinned in talos-cluster). build.shderives the per-OS var-files and runspacker buildfor the matrix entry (Windows uses--edition+--onlyto build a single source).- Install — media is mounted from the datastore as a CD
(
common_data_source=disk); the guest installs unattended (cloud-init / kickstart / preseed for Linux,autounattend.xmlon a cidata CD for Windows). The pod needs only egress (vCenter 443, SSH 22 / WinRM 5985) — no inbound. - Provision — the Ansible provisioner connects (SSH for Linux, WinRM for Windows), applies updates and base config.
- Convert + promote — Packer converts the VM to a template named
<base>-build; the promote step renames it into the stable<base>that Terraform clones, rolling the previous generation to<base>-prev:
flowchart TB
S1["destroy <base>-prev<br>(old rollback)"]
S2["rename <base> → <base>-prev<br>(rollback kept)"]
S3["rename <base>-build → <base><br>(new template live)"]
S1 --> S2 --> S3
Success-only: a failed build never touches the stable <base>.
Triage tooling¶
- Console screenshots (the decisive tool):
govc vm.console -capture out.png <vm>as administrator (thesvc-packerservice account lacks console-interact). Run while the build is in flight. - Guest state:
govc vm.info -json <vm>→runtime.powerState/guest.ipAddress/guest.toolsRunningStatus. - Packer log:
PACKER_LOG=1writespacker-build.log, uploaded as a (secret-scrubbed) artifact on failure only.
See Windows Templates for the Windows-specific path and its gotchas.