Skip to Content
DocsAdmin GuideCP CA Rotation (mTLS)

Control Panel CA rotation

Every OS node, at registration, receives a client certificate signed by the Control Panel’s CA plus a copy of the CA’s public cert, which it pins for verifying the CP server. After a CA rotation those pinned copies no longer match the new CA — every registered node fails the mTLS handshake with:

tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: ECDSA verification failure" while trying to verify candidate authority certificate "Quazzar Control Center CA")

The CP refuses to auto-regenerate the CA on every boot for exactly this reason — the CA lives on a PersistentVolume at /var/lib/quazzar-cc/certs/{ca.key,ca.crt} and is meant to last 10 years. But rotations do happen (key compromise, vendor switch, accidental PVC recreation with CC_ALLOW_AUTO_CA_GEN=true). This page covers the zero-downtime procedure.

Symptoms — how to know you’ve drifted

  • The CP log fills with TLS handshake error … x509: ECDSA verification failure ….
  • Every OS node shows “needs re-registration” in its CC agent status, with the local log line:

    CC no longer trusts this node’s client certificate — the CA was likely rotated on the Control Panel.

  • kubectl exec into the CP pod and compare:
    openssl x509 -in /var/lib/quazzar-cc/certs/ca.crt -noout -fingerprint -sha256
    with a fleet node’s:
    ssh <node> sudo openssl x509 -in /var/lib/quazzar/certs/ca.crt -noout -fingerprint -sha256
    Different fingerprints → you have CA drift.

Recovery — the rotation grace pool

Don’t try to back-rotate the new CA out — that would invalidate any nodes that have already picked it up. Instead, keep the previous CA trusted on the CP for a grace window while nodes naturally re-register.

  1. Preserve the old CA cert. Whatever you have from before the rotation — a backup of /var/lib/quazzar-cc/certs/ca.crt, a copy exported from any still-running OS node at /var/lib/quazzar/certs/ca.crt, or a cert pulled out of a registered instance’s heartbeat headers — place it on the CP host as e.g. /var/lib/quazzar-cc/certs/ca.crt.old.

  2. Add it to the trust pool. Set the env var on the CP deployment:

    CC_TRUST_ADDITIONAL_CA_CERTS=/var/lib/quazzar-cc/certs/ca.crt.old

    (Comma-separate if you have multiple historical CAs.) Roll the deployment. On boot the CP logs:

    loaded additional trusted CA (rotation grace)

    The active CA is still the only one that signs new certs. The legacy CAs only extend the trust pool for verifying existing client certs.

  3. Re-register affected nodes at your own pace. Each node, once re-registered, picks up the new CA and stops depending on the old one. Use the same flow you used to bootstrap the node originally (generate a registration token in the CP UI, POST it to the OS’s /api/cc/register).

  4. Decommission the legacy CA. When every active fleet node has re-registered (and any nodes pinned to the old CA are gone or have their client certs naturally expired — client certs are 90 days), remove the entry from CC_TRUST_ADDITIONAL_CA_CERTS and roll the deployment again. The pool returns to just the active CA.

Preventing the next drift

  • Always mount /var/lib/quazzar-cc/certs as a PersistentVolume in production. The CP refuses to auto-generate when the dir is missing unless CC_ALLOW_AUTO_CA_GEN=true — keep that env var unset in prod manifests so a PVC mishap can’t silently mint a fresh CA.

  • Back up the CA pair out-of-band. The 10-year validity makes backups easy to forget, but losing the key forces a full fleet re-registration with no graceful path.

  • Monitor for the noise pattern. Genuine drift produces a steady stream of TLS handshake error … ECDSA verification failure from public source IPs (not LB internal CIDRs — those probe the mTLS port without certs as TCP health checks and are demoted to DEBUG automatically). If you see this from a public IP, suspect drift.

Why not just rotate the active CA back?

Because any node that registered after the rotation has already pinned the new CA. Reverting the active CA cuts those nodes off instead. The grace-pool approach keeps both populations working until each node transitions on its own schedule.