Add ZVMA pre/post script recipe + env-dump examples

Adds a Kubernetes-ZVMA companion to the existing Windows-ZVM recipe: - scripts/examples/zerto-zvma-send.ps1 - Zerto-side sender for both pre and post phases, packages the Zerto* env vars into a structured JSON body and POSTs to a {phase}-templated webhook URL. - scripts/examples/zerto-receiver-notify.ps1 - server-side receiver that posts a Slack/Teams notification, with phase-aware formatting and ZertoForce highlighted on pre. - scripts/examples/zerto-receiver-vm-healthcheck.ps1 - server-side receiver that pings + port-probes each VM in VmDisplayNames after failover and writes a per-run JSON report. - scripts/examples/send-env-vars.ps1 + save-env-vars.ps1 - generic env-dump client/receiver pair (the diagnostic that surfaced what the ZVMA scripts-service container exposes). - docs/recipes/zerto-zvma-pre-post.md - full walkthrough mirroring the existing Windows-ZVM recipe's structure. - README.md and docs/README.md - link the new recipe and examples. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 14:16:07 -04:00
parent 4954e94d08
commit 821ff9b9ef
8 changed files with 701 additions and 4 deletions
@@ -0,0 +1,277 @@
+# Recipe: Zerto ZVMA (Kubernetes) pre/post scripts → notify + VM health check
+
+> Companion to [Zerto failover post-script → DNS + service checks](zerto-pre-post-scripts.md).
+> That recipe targets the **Windows ZVM** (the older deployment, where the
+> Zerto-side script is a `.ps1` calling `curl.exe`). **This** recipe targets
+> the **ZVMA on Kubernetes** — the newer deployment, where pre/post scripts
+> run inside the in-cluster `scripts-service` container (Linux + pwsh 7).
+> The webhook-server side is the same Windows service in both cases; only
+> the Zerto-side runtime differs.
+
+## What we're building
+
+ZVMA's `scripts-service` pod runs your VPG pre/post scripts inside a Linux
+container. It exposes a small set of `Zerto*` environment variables, and we
+want to:
+
+1. POST those variables to a Webhook Server endpoint at the start (pre) and
+   end (post) of every VPG operation, and
+2. On the receiving Windows host, do something useful with them — at minimum
+   a chat notification, and on `post` a quick health check of the VMs that
+   just powered on.
+
+The endpoints are **Async**, so the Zerto VPG sequence is never blocked by
+slow downstream actions (notifications, port probes, etc.).
+
+```
+Zerto VPG operation starts
+   |
+   +-- ZVMA scripts-service container runs:
+   |     /app/scripts-files/zerto-zvma-send.ps1 -Phase pre
+   |       -> POST http://webhook.dr/hook/zerto-pre   (async, returns 202)
+   |
+   +-- VMs come up at recovery site
+   |
+   +-- ZVMA scripts-service container runs:
+         /app/scripts-files/zerto-zvma-send.ps1 -Phase post
+           -> POST http://webhook.dr/hook/zerto-post  (async, returns 202)
+
+(meanwhile, on the webhook server)
+   /hook/zerto-pre  -> Slack/Teams notification ("Test failover starting...")
+   /hook/zerto-post -> Slack/Teams notification + ping/port probe each VM,
+                       write a JSON report to disk, exit non-zero on failure.
+```
+
+## What ZVMA exposes
+
+Captured from a real Test failover; same set is present in pre and post:
+
+| Variable | Example | Notes |
+|---|---|---|
+| `ZertoVPGName` | `ubuntu-2404-local` | The VPG that fired the script |
+| `ZertoInternalVpgName` | `ubuntu-2404-local` | Usually identical to `ZertoVPGName` |
+| `ZertoOperation` | `Test` | `Test` / `Failover` / `Move` / `FailoverBeforeCommit` / `FailoverDuringCommit` |
+| `ZertoForce` | `Yes` (pre) / `No` (post) | Set to `Yes` only during the pre phase when force mode is on; reset to `No` by post |
+| `VmDisplayNames` | `ubuntu-2404(1)(1)(1)` | Comma-separated for multi-VM VPGs; Test failovers add `(N)` suffixes |
+| `ZertoHypervisorManagerIP` | `192.168.50.20` | The vCenter / Hyper-V manager ZVMA is talking to |
+| `ZertoHypervisorManagerPort` | `443` | |
+| `ZertoOutputDir` | `/app/scripts-output` | Container-side output dir (written back to ZVMA via PVC) |
+| `ZertoWorkingDir` | `/app/scripts-files` | Where script files live in-container |
+
+Branch on `ZertoOperation` to differentiate Test runs from real failovers.
+**`ZertoForce` is only meaningful during the pre phase** — capture it there
+if you need it later, because by post it's been reset.
+
+## 1. The Zerto-side script (sender)
+
+A ready-to-use script ships in this repo at
+[`scripts/examples/zerto-zvma-send.ps1`](../../scripts/examples/zerto-zvma-send.ps1).
+Place it where the `scripts-service` pod can read it — typically the
+`scripts-service-scripts-files-pvc`, mounted at `/app/scripts-files/` — and
+wire it into the VPG twice:
+
+> **VPG settings → Recovery → Scripts → Pre-Recovery Script**
+> Path: `/app/scripts-files/zerto-zvma-send.ps1`
+> Parameters: `-Phase pre`
+>
+> **VPG settings → Recovery → Scripts → Post-Recovery Script**
+> Path: `/app/scripts-files/zerto-zvma-send.ps1`
+> Parameters: `-Phase post`
+
+The default `$WebhookUrl` includes `{phase}` so one script + one URL config
+serves both phases — `http://webhook.dr/hook/zerto-{phase}` becomes
+`/hook/zerto-pre` and `/hook/zerto-post` automatically. Override with
+`-WebhookUrl` and `-Bearer` if you'd rather pass them per-VPG.
+
+The script POSTs a single JSON object:
+
+```json
+{
+  "phase": "pre",
+  "capturedAt": "2026-05-08T17:45:54Z",
+  "host": "scripts-service-f9b6cb7-4xbxq",
+  "zerto": {
+    "vpgName":               "ubuntu-2404-local",
+    "internalVpgName":       "ubuntu-2404-local",
+    "operation":             "Test",
+    "force":                 "Yes",
+    "vmDisplayNames":        "ubuntu-2404(1)(1)(1)",
+    "hypervisorManagerIP":   "192.168.50.20",
+    "hypervisorManagerPort": "443",
+    "outputDir":             "/app/scripts-output",
+    "workingDir":            "/app/scripts-files"
+  }
+}
+```
+
+A webhook outage **does not fail the VPG** — the script catches and exits 0.
+Comment in the file shows how to flip that to strict mode if you'd rather a
+webhook outage abort the failover.
+
+## 2. The webhook-server-side scripts (receivers)
+
+Two examples ship in the repo. Both read the JSON body from stdin (the
+webhook server delivers the body to the script's stdin when **JSON body to
+stdin** is ticked on the endpoint).
+
+### a. Slack/Teams notification — both phases
+
+[`scripts/examples/zerto-receiver-notify.ps1`](../../scripts/examples/zerto-receiver-notify.ps1)
+posts a single-line summary to a Slack or Teams Incoming Webhook URL. It
+picks an icon based on `ZertoOperation`:
+
+- `Test` → 🧪 — benign, expected
+- `Failover` → 🚨 — real production event
+- `Move` → 🚚 — planned migration
+
+…and highlights `ZertoForce=Yes` on the **pre** message so you can see at
+a glance whether the operation was force-flagged.
+
+Set the destination via `NOTIFY_URL` env var on the webhook host, or
+hardcode at the top of the script.
+
+### b. Post-recovery VM health check — post phase only
+
+[`scripts/examples/zerto-receiver-vm-healthcheck.ps1`](../../scripts/examples/zerto-receiver-vm-healthcheck.ps1)
+runs only on `phase=post` for operations that bring VMs up
+(`Test`/`Failover`/`Move`/`FailoverBeforeCommit`/`FailoverDuringCommit`).
+For each name in `VmDisplayNames` it:
+
+1. Strips the trailing `(1)(1)(1)` suffix Zerto adds on Test failovers, so
+   DNS resolution targets the actual hostname.
+2. Pings (`Test-Connection`).
+3. Probes a configurable TCP port (`-ProbePort`, default `3389` for RDP;
+   use `22` for SSH or `443` for the web tier).
+4. Writes a JSON report to
+   `C:\ProgramData\WebhookServer\zerto-healthchecks\<vpg>-<op>-<utcstamp>.json`.
+5. Exits non-zero if any VM failed either probe — which surfaces in the
+   webhook server's run history (and outbound callback, if configured).
+
+Bump the endpoint's **Timeout (sec)** to `120` when wiring this in, since
+network probes can take a while.
+
+## 3. Configure the endpoints in the GUI
+
+Two endpoints. Identical except for the slug, the script, and (for the
+healthcheck) the timeout.
+
+### `zerto-pre`
+
+| Section | Setting | Value |
+|---|---|---|
+| Identity | Slug | `zerto-pre` |
+| Identity | Description | "Zerto pre-recovery: chat notification" |
+| Auth | Mode | **Bearer** |
+| Auth | Bearer secret | generate a 32-byte random string; reuse for `zerto-post` |
+| Allowed clients | (one per line) | the IP of the K8s node running `scripts-service` (e.g. `192.168.50.30`) |
+| Executor | Type | **Windows PowerShell** (or PowerShell 7) |
+| Executor | Script path | `C:\scripts\zerto-receiver-notify.ps1` |
+| Data passing | JSON body to stdin | ✓ |
+| Run as | Identity | **Service** |
+| Response | Mode | **Async** |
+| Response | Timeout (sec) | `30` |
+| Response | Fail on non-zero exit | unticked *(async hooks have no caller to receive a 502)* |
+
+### `zerto-post`
+
+Same as above, except:
+
+| Setting | Value |
+|---|---|
+| Slug | `zerto-post` |
+| Description | "Zerto post-recovery: notify + VM health check" |
+| Script path | a **wrapper** that calls both receiver scripts in turn (see below) |
+| Timeout (sec) | `120` |
+
+Two receivers on one endpoint is easiest with a tiny wrapper that fans
+stdin out to both scripts:
+
+```powershell
+# C:\scripts\zerto-post-fanout.ps1
+$body = [Console]::In.ReadToEnd()
+$body | & 'C:\scripts\zerto-receiver-notify.ps1'
+$body | & 'C:\scripts\zerto-receiver-vm-healthcheck.ps1'
+```
+
+Or run the two as separate endpoints (`zerto-post-notify` and
+`zerto-post-healthcheck`) and have the Zerto-side script POST to both —
+either pattern is fine. The fanout wrapper keeps the Zerto config simpler.
+
+## 4. Wire up the bearer token
+
+On the ZVMA / scripts-service side, the easiest place to put the token is
+a Kubernetes Secret mounted into the pod, but the simplest approach for
+testing is to pass it as a parameter to the Zerto-side script:
+
+> VPG settings → Pre-Recovery Script → Parameters:
+> `-Phase pre -Bearer <paste-token>`
+>
+> VPG settings → Post-Recovery Script → Parameters:
+> `-Phase post -Bearer <paste-token>`
+
+For production, mount a Secret at a known path in the pod and have the
+sender script read from it (`Get-Content /run/secrets/webhook-token`).
+
+## 5. Test before going live
+
+Run a Test failover on a non-critical VPG. Watch:
+
+- **Slack/Teams**: a `:test_tube: Zerto Test - phase: pre` message arrives,
+  followed ~30s–several minutes later by a `:test_tube: Zerto Test - phase:
+  post` message.
+- **Webhook Server GUI** → run history: two runs for `zerto-pre` /
+  `zerto-post`, both green.
+- **`C:\ProgramData\WebhookServer\zerto-healthchecks\`**: a fresh JSON
+  report named `<vpg>-Test-<utcstamp>.json` containing per-VM ping and port
+  probe results.
+- **ZVMA**: the VPG operation completes successfully; nothing in the
+  pre/post logs blocked on the webhook.
+
+## Variations
+
+### Branch on Test vs. real failover in the receivers
+
+The notifier already styles the message differently. To do something only
+on a real failover (e.g. update DNS), guard with:
+
+```powershell
+if ($p.zerto.operation -ne 'Test') {
+    # do the destructive thing
+}
+```
+
+A `ZertoOperation` of `Test` means "exercise — don't touch production
+dependencies." Always check it before doing anything that mutates real
+state.
+
+### Capture `ZertoForce` from pre for use in post
+
+`ZertoForce` is `Yes` only during the **pre** phase when force mode is on
+and is reset to `No` by the **post** phase. If your post-side logic needs
+to know the operation was force-flagged, save it during pre (e.g. write a
+small marker to the shared `ZertoOutputDir`) and read it back during post.
+
+### Per-VPG endpoints
+
+For fine-grained access control or different actions per VPG, create one
+endpoint per VPG (`zerto-pre-app01`, `zerto-post-app01`, …) with its own
+bearer token. Override `-WebhookUrl` and `-Bearer` on the Zerto side per
+VPG.
+
+### Audit trail
+
+Every endpoint can have an outbound **Callback** URL. Configure with your
+SIEM's HTTP collector + an HMAC secret, and every run produces a JSON
+record with runId, exit code, duration, stdout, and stderr — convenient
+for compliance.
+
+## Security note
+
+The ZVMA `scripts-service` pod runs your scripts inside a Linux container
+with broad reach into the management cluster — anything your script does
+runs with whatever ServiceAccount that pod uses. Treat the script content
+as privileged and make sure pre/post script edit rights are restricted to
+trusted operators. If you're unfamiliar with the pod's RBAC posture, check
+`Get-ChildItem Env:` from inside the container and look at
+`/var/run/secrets/kubernetes.io/serviceaccount/` — that token is what your
+scripts (and a malicious script) can use to talk to the K8s API.