feat: add zroc-planner Python vCenter collector

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Justin
2026-04-12 19:35:31 -04:00
parent 74c05e5a58
commit 796bafac63
6 changed files with 964 additions and 0 deletions
+187
View File
@@ -0,0 +1,187 @@
# zROC Planner — vCenter Metrics Collector
A Python-based Prometheus exporter that replaces zPlanner's PowerCLI scripts.
It queries vCenter for per-VM virtual-disk I/O statistics using the pyvmomi SDK
and exposes them on a `/metrics` endpoint for Prometheus to scrape.
## Metrics exposed
| Prometheus metric | Unit | Description |
|---|---|---|
| `vcenter_vm_disk_write_iops` | IOPS | Write IOPS (sum across all disk instances) |
| `vcenter_vm_disk_write_throughput_mbps` | MB/s | Write throughput (sum across all disk instances) |
| `vcenter_vm_disk_write_latency_ms` | ms | Write latency (mean across all disk instances) |
Every metric carries these labels:
| Label | Example | Notes |
|---|---|---|
| `vm_name` | `web-prod-01` | VM display name |
| `vm_moref` | `vm-1234` | vCenter Managed Object Reference (stable ID) |
| `cluster` | `Cluster-01` | Compute cluster name |
| `host` | `esxi-01.corp` | ESXi host the VM is running on |
| `datacenter` | `DC-East` | vCenter datacenter name |
### Self-monitoring metrics
| Metric | Description |
|---|---|
| `vcenter_collector_last_collection_timestamp_seconds` | Unix timestamp of the last successful poll |
| `vcenter_collector_last_collection_duration_seconds` | How long the last poll took |
| `vcenter_collector_last_vm_count` | VMs collected in the last cycle |
| `vcenter_collector_cycles_total` | Running count of completed cycles |
## Configuration
All settings are environment variables.
| Variable | Default | Description |
|---|---|---|
| `VCENTER_HOST` | `vcenter.local` | vCenter hostname or IP |
| `VCENTER_USER` | `administrator@vsphere.local` | vCenter username (read-only is sufficient) |
| `VCENTER_PASSWORD` | _(required)_ | vCenter password |
| `VCENTER_PORT` | `443` | vCenter HTTPS port |
| `VCENTER_SSL_VERIFY` | `false` | Set `true` to enforce TLS certificate validation |
| `POLL_INTERVAL` | `300` | Seconds between collection cycles |
| `BATCH_SIZE` | `100` | VMs per QueryPerf call (VMware recommends ≤ 200) |
| `BATCH_DELAY` | `0.5` | Seconds to sleep between batches |
| `VM_INVENTORY_TTL` | `600` | Seconds between VM inventory refreshes |
| `PERF_INTERVAL_ID` | `300` | vCenter rollup interval (300 = 5-minute stats) |
| `HTTP_HOST` | `0.0.0.0` | IP the HTTP server binds to |
| `HTTP_PORT` | `9272` | Port for `/metrics` and `/health` |
| `LOG_LEVEL` | `INFO` | Python log level (`DEBUG`, `INFO`, `WARNING`, `ERROR`) |
## Running
### Docker (recommended)
```bash
docker build -t zroc-planner:latest .
docker run -d \
--name zroc-planner \
-p 9272:9272 \
-e VCENTER_HOST=vcenter.corp.example \
-e VCENTER_USER=svc-readonly@vsphere.local \
-e VCENTER_PASSWORD=supersecret \
-e POLL_INTERVAL=300 \
-e BATCH_SIZE=100 \
zroc-planner:latest
```
### Docker Compose
Add to your existing `docker-compose.yaml`:
```yaml
zroc-planner:
build: ./zroc-planner
container_name: zroc-planner
hostname: zroc-planner
ports:
- "9272:9272"
environment:
- VCENTER_HOST=vcenter.corp.example
- VCENTER_USER=svc-readonly@vsphere.local
- VCENTER_PASSWORD=supersecret
- POLL_INTERVAL=300
- BATCH_SIZE=100
- BATCH_DELAY=0.5
- VM_INVENTORY_TTL=600
- LOG_LEVEL=INFO
networks:
- back-tier
restart: always
```
Then add a scrape job to `prometheus/prometheus.yml`:
```yaml
scrape_configs:
- job_name: vcenter_planner
scrape_interval: 300s
scrape_timeout: 30s
static_configs:
- targets: ['zroc-planner:9272']
```
### Local (dev)
```bash
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
export VCENTER_HOST=vcenter.local
export VCENTER_USER=administrator@vsphere.local
export VCENTER_PASSWORD=password
python server.py
```
## Endpoints
| Path | Description |
|---|---|
| `GET /metrics` | Prometheus text exposition |
| `GET /health` | JSON health check (200 = healthy, 503 = degraded) |
### Health check response
```json
{
"status": "ok",
"last_collection_time": 1712345678.0,
"last_collection_duration_seconds": 4.23,
"last_vm_count": 3412,
"last_error": null,
"collection_cycles": 42,
"stale": false
}
```
## Scaling notes
| Environment size | Recommended settings |
|---|---|
| ≤ 500 VMs | `BATCH_SIZE=100`, `BATCH_DELAY=0.5` |
| 5005 000 VMs | `BATCH_SIZE=150`, `BATCH_DELAY=0.5` |
| 5 00015 000 VMs | `BATCH_SIZE=200`, `BATCH_DELAY=1.0` |
Collection of 15 000 VMs at `BATCH_SIZE=200` with `BATCH_DELAY=1.0` takes
roughly 75 batches × ~1 s each = **~75 seconds**, comfortably within the
300-second poll window.
## vCenter permissions
The service account only needs the read-only role on the vCenter root:
- `System.View`
- `Performance.ModifyIntervals` (read-only — needed to query counter definitions)
A standard **Read-Only** vCenter role is sufficient.
## Architecture
```
┌────────────────────────────────────────────────────────────┐
│ server.py (main thread) │
│ HTTPServer /metrics /health │
└────────────────────────┬───────────────────────────────────┘
│ reads from
┌────────────────────────▼───────────────────────────────────┐
│ collector.MetricStore (thread-safe dict) │
└────────────────────────▲───────────────────────────────────┘
│ writes to
┌────────────────────────┴───────────────────────────────────┐
│ collector.VCenterCollector (daemon thread) │
│ ┌──────────────────────────────────────────────────┐ │
│ │ every POLL_INTERVAL seconds: │ │
│ │ 1. ensure vCenter session alive │ │
│ │ 2. refresh VM inventory (if TTL expired) │ │
│ │ 3. for each batch of BATCH_SIZE VMs: │ │
│ │ QueryPerf → parse → MetricStore.update │ │
│ │ sleep(BATCH_DELAY) │ │
│ └──────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────┘
```