Files
webhook-server/docs/recipes/zerto-pre-post-scripts.md
justin b17d832842
Sync Wiki / sync (push) Has been cancelled
CI / build (push) Has been cancelled
Sync from GitHub main: v0.1.1 + v0.1.2 + wiki sync (#3)
2026-05-08 11:14:17 -04:00

244 lines
11 KiB
Markdown

# Recipe: Zerto failover post-script → DNS update + service checks
This is the canonical reason Webhook Server exists.
When Zerto fails a VM over from production to DR, the VM boots fine — but **the things around it** often need attention: DNS records still point at the production IP, dependent services need to be checked, on-call needs a heads-up. Zerto pre/post scripts run on the **Zerto Virtual Manager**, not on a domain controller and not necessarily with admin rights to the things that need fixing. So you want a single webhook URL that the post-script hits, and a Windows host on the DR side that does the actual work with the right identity.
## What we're building
Zerto's post-recovery script (a one-shot PowerShell file pointing at curl) calls `http://webhook.dr.contoso.local:8080/hook/post-failover` with a JSON body identifying the VPG and operation. The Webhook Server, running on a DR-side Windows host as a gMSA with delegated AD/DNS rights, runs PowerShell that:
1. Updates DNS A records to point the failed-over hostnames at their DR IPs
2. Waits for the failed-over VM to come up (ping + WinRM probe)
3. Connects to the VM via PowerShell remoting and starts/checks critical services
4. Sends a Teams notification with the result
The endpoint is **Async** so the Zerto script returns in milliseconds — no risk of timing out Zerto's failover sequence even if the actions take minutes. The script's full output ends up in the webhook log and (optionally) in an outbound callback.
## Why curl and not Invoke-WebRequest?
Zerto's PowerShell runner is intentionally minimal — many environments run an older Windows on the ZVM and don't have full PowerShell modules installed. `curl.exe` ships with Windows 10 1803+ and Server 2019+ and works without any modules. Plus, calling an HTTP endpoint with `curl.exe` doesn't depend on the version of `Invoke-WebRequest` shipped with the host's PowerShell.
## 1. The Zerto post-script (client side)
A ready-to-use script ships in this repo at [`scripts/examples/zerto-post-failover.ps1`](../../scripts/examples/zerto-post-failover.ps1). Copy it to the ZVM, edit `$WebhookUrl` and the bearer-token path at the top, and wire it into the VPG:
> **VPG settings → Recovery → Scripts → Post-Recovery Script**
> Path: `C:\Scripts\zerto-post-failover.ps1`
> Parameters: *(leave empty)*
The script is ~50 lines and only depends on `curl.exe` + a token file readable by the ZVM service account.
The flow:
```
Zerto VPG failover starts
|
+-- VM is brought up at DR site
|
+-- Zerto post-script fires:
| curl POST http://webhook.dr/hook/post-failover (async, returns 202 in ~50ms)
|
+-- Zerto sees success, finishes the failover and reports done
|
(meanwhile, on the webhook server)
|
running PowerShell for several minutes:
- update DNS
- wait for VM ready
- check services on VM
- notify Teams
```
## 2. The server-side script (does the actual work)
Save this on the webhook host as `C:\Scripts\post-failover-handler.ps1`:
```powershell
[CmdletBinding()]
param()
$ErrorActionPreference = 'Stop'
$body = $input | ConvertFrom-Json
# ---------- environment specifics; edit for your site ----------
$dnsServer = 'dc01.contoso.local'
$forwardZone = 'contoso.local'
$teamsWebhook = 'https://contoso.webhook.office.com/...'
$drIpMap = @{
'app01' = '10.42.10.11'
'app02' = '10.42.10.12'
'db01' = '10.42.10.21'
}
$serviceMap = @{
'app01' = @('W3SVC','MyAppSvc')
'app02' = @('W3SVC','MyAppSvc')
'db01' = @('MSSQLSERVER','SQLAgent')
}
# ---------------------------------------------------------------
# Default the VM list to "all VMs we know about" if the post-script didn't
# tell us, so the same handler works without having to embed the VM list in
# every Zerto post-script.
$vms = if ($body.vms) { $body.vms } else { $drIpMap.Keys }
$summary = @()
foreach ($vm in $vms) {
if (-not $drIpMap.ContainsKey($vm)) {
$summary += "skip $vm (no DR IP mapping in handler)"
continue
}
$ip = $drIpMap[$vm]
# 1. DNS - delete + re-add the A record
try {
$existing = Get-DnsServerResourceRecord -ZoneName $forwardZone -Name $vm `
-RRType A -ComputerName $dnsServer -ErrorAction SilentlyContinue
if ($existing) {
Remove-DnsServerResourceRecord -ZoneName $forwardZone -Name $vm `
-RRType A -RecordData $existing.RecordData.IPv4Address `
-ComputerName $dnsServer -Force
}
Add-DnsServerResourceRecordA -ZoneName $forwardZone -Name $vm `
-IPv4Address $ip -ComputerName $dnsServer -TimeToLive 00:05:00
$summary += "dns $vm -> $ip"
} catch {
$summary += "DNS! $vm $($_.Exception.Message)"
continue
}
# 2. Wait for the VM to be reachable (up to 5 minutes)
$deadline = (Get-Date).AddMinutes(5)
$reachable = $false
while ((Get-Date) -lt $deadline) {
if (Test-Connection -ComputerName $ip -Count 1 -Quiet -ErrorAction SilentlyContinue) {
try {
# Quick WinRM probe; succeeds when the VM has finished booting
Invoke-Command -ComputerName $ip -ScriptBlock { $true } -ErrorAction Stop | Out-Null
$reachable = $true
break
} catch { Start-Sleep -Seconds 10 }
} else {
Start-Sleep -Seconds 10
}
}
if (-not $reachable) {
$summary += "wait! $vm not reachable after 5 minutes"
continue
}
# 3. Check + start critical services on the VM
if ($serviceMap.ContainsKey($vm)) {
$svcReport = Invoke-Command -ComputerName $ip -ArgumentList @(,$serviceMap[$vm]) -ScriptBlock {
param($services)
$report = @()
foreach ($s in $services) {
$svc = Get-Service -Name $s -ErrorAction SilentlyContinue
if (-not $svc) { $report += "$s : missing"; continue }
if ($svc.Status -ne 'Running') {
Start-Service $s
Start-Sleep -Seconds 2
$svc.Refresh()
}
$report += "$s : $($svc.Status)"
}
return $report
}
$summary += "svc $vm : $($svcReport -join ', ')"
} else {
$summary += "svc $vm (no services configured)"
}
}
# 4. Notify Teams
$teamsBody = @{
text = "Webhook post-failover for VPG **$($body.vpg)**:`n" + ($summary -join "`n")
} | ConvertTo-Json
try {
Invoke-RestMethod -Uri $teamsWebhook -Method POST -ContentType 'application/json' -Body $teamsBody | Out-Null
} catch {
$summary += "teams! notification failed: $($_.Exception.Message)"
}
# Return the summary so it shows up in the webhook log + outbound callback
$summary -join "`n"
```
Two things to call out:
- **PowerShell remoting to the VM** uses the gMSA's network identity (or whoever the service runs as). Make sure the gMSA / service account can `Invoke-Command` to the failed-over hosts — usually that means the account is a local admin on the target VMs, or you've configured constrained delegation.
- **WinRM** must be enabled on the failed-over VMs for the remoting calls to work. `Enable-PSRemoting` is the simplest, but most prod environments configure WinRM via Group Policy.
## 3. Configure the endpoint in the GUI
**File → New endpoint:**
| Section | Setting | Value |
|---|---|---|
| Identity | Slug | `post-failover` |
| Identity | Description | "Zerto post-recovery: DNS + service checks" |
| Auth | Mode | **Bearer** |
| Auth | Bearer secret | generate a 32-byte random string; copy it for the Zerto script's token file |
| Allowed clients | (one per line) | `10.0.0.0/8` *(your ZVM's network)* |
| Executor | Type | **Windows PowerShell** |
| Executor | Script path | `C:\Scripts\post-failover-handler.ps1` |
| Data passing | JSON body to stdin | ✓ |
| Run as | Identity | **Service** if the service runs under a gMSA with the right rights, otherwise **SpecificUser** with a delegated account |
| Response | Mode | **Async** ← critical: this is what makes the Zerto script non-blocking |
| Response | Timeout (sec) | `600` *(this is the cap on the long-running handler script, not the Zerto-facing response)* |
| Response | Fail on non-zero exit | unticked *(async hooks have no caller to receive a 502)* |
Save. Right-click the row → **Copy URL** to grab `http://webhook.dr.contoso.local:8080/hook/post-failover` and paste it into `$WebhookUrl` at the top of the Zerto-side script.
> **Why Bearer instead of HMAC?** Both work. Bearer is simpler — drop the token in a file on the ZVM that's readable by the ZVM service account and you're done. HMAC requires the Zerto-side script to compute a signature, which is doable but adds a few lines of code. Pick what fits your environment.
## 4. Wire up the bearer token
Place the bearer token in a file the ZVM service account can read (and nobody else):
```powershell
# on the ZVM, from elevated PowerShell
$token = (New-Guid).ToString('N') # or paste the value from the GUI
$tokenPath = 'C:\ProgramData\Zerto\webhook-token.txt'
$token | Out-File -LiteralPath $tokenPath -Encoding utf8 -NoNewline
icacls $tokenPath /inheritance:r /grant 'NT SERVICE\Zerto Online Services:R' 'BUILTIN\Administrators:F' /T
```
Adjust the service principal name to whatever Zerto runs as on your version. The script reads from this path automatically; no change needed in the script itself.
## 5. Test before going live
In a maintenance window, fire the webhook by hand:
```powershell
# from any machine that can reach the webhook server
$body = @{
operation = 'test'
vpg = 'SmokeTest'
timestamp = (Get-Date).ToUniversalTime().ToString('o')
} | ConvertTo-Json -Compress
curl.exe --silent --show-error --max-time 10 -X POST `
-H "Authorization: Bearer paste-the-token" `
-H "Content-Type: application/json" `
-d $body `
http://webhook.dr.contoso.local:8080/hook/post-failover
```
You'll get back `{"runId":"…","accepted":true}` immediately. Open the Webhook Server GUI and watch the log panel — within 30 seconds or so you'll see lines for the run. Confirm DNS records updated, services on each VM ended in `Running`, and the Teams notification arrived.
## Variations
### Different actions for failover vs. failback
Pass an `operation` field in the body and branch on it. The Zerto-side script already sends `operation = 'failover'`. Add a separate post-failback script (or detect from `$env:ZertoOperationType`) that sends `operation = 'failback'` and have the handler revert DNS to production IPs.
### Per-VPG endpoints
If you want fine-grained access control or different actions per VPG, create one endpoint per VPG (`post-failover-app`, `post-failover-db`, …) and give each its own bearer token. The GUI handles dozens of endpoints fine.
### Audit trail to a SIEM
Each endpoint can have an outbound **Callback** URL. Configure it with your SIEM's HTTP collector + an HMAC secret, and every run produces a JSON record with runId, exit code, duration, stdout, and stderr — perfect for compliance.