Files
zroc/zroc-planner/README.md
T
2026-04-12 19:35:31 -04:00

188 lines
6.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# zROC Planner — vCenter Metrics Collector
A Python-based Prometheus exporter that replaces zPlanner's PowerCLI scripts.
It queries vCenter for per-VM virtual-disk I/O statistics using the pyvmomi SDK
and exposes them on a `/metrics` endpoint for Prometheus to scrape.
## Metrics exposed
| Prometheus metric | Unit | Description |
|---|---|---|
| `vcenter_vm_disk_write_iops` | IOPS | Write IOPS (sum across all disk instances) |
| `vcenter_vm_disk_write_throughput_mbps` | MB/s | Write throughput (sum across all disk instances) |
| `vcenter_vm_disk_write_latency_ms` | ms | Write latency (mean across all disk instances) |
Every metric carries these labels:
| Label | Example | Notes |
|---|---|---|
| `vm_name` | `web-prod-01` | VM display name |
| `vm_moref` | `vm-1234` | vCenter Managed Object Reference (stable ID) |
| `cluster` | `Cluster-01` | Compute cluster name |
| `host` | `esxi-01.corp` | ESXi host the VM is running on |
| `datacenter` | `DC-East` | vCenter datacenter name |
### Self-monitoring metrics
| Metric | Description |
|---|---|
| `vcenter_collector_last_collection_timestamp_seconds` | Unix timestamp of the last successful poll |
| `vcenter_collector_last_collection_duration_seconds` | How long the last poll took |
| `vcenter_collector_last_vm_count` | VMs collected in the last cycle |
| `vcenter_collector_cycles_total` | Running count of completed cycles |
## Configuration
All settings are environment variables.
| Variable | Default | Description |
|---|---|---|
| `VCENTER_HOST` | `vcenter.local` | vCenter hostname or IP |
| `VCENTER_USER` | `administrator@vsphere.local` | vCenter username (read-only is sufficient) |
| `VCENTER_PASSWORD` | _(required)_ | vCenter password |
| `VCENTER_PORT` | `443` | vCenter HTTPS port |
| `VCENTER_SSL_VERIFY` | `false` | Set `true` to enforce TLS certificate validation |
| `POLL_INTERVAL` | `300` | Seconds between collection cycles |
| `BATCH_SIZE` | `100` | VMs per QueryPerf call (VMware recommends ≤ 200) |
| `BATCH_DELAY` | `0.5` | Seconds to sleep between batches |
| `VM_INVENTORY_TTL` | `600` | Seconds between VM inventory refreshes |
| `PERF_INTERVAL_ID` | `300` | vCenter rollup interval (300 = 5-minute stats) |
| `HTTP_HOST` | `0.0.0.0` | IP the HTTP server binds to |
| `HTTP_PORT` | `9272` | Port for `/metrics` and `/health` |
| `LOG_LEVEL` | `INFO` | Python log level (`DEBUG`, `INFO`, `WARNING`, `ERROR`) |
## Running
### Docker (recommended)
```bash
docker build -t zroc-planner:latest .
docker run -d \
--name zroc-planner \
-p 9272:9272 \
-e VCENTER_HOST=vcenter.corp.example \
-e VCENTER_USER=svc-readonly@vsphere.local \
-e VCENTER_PASSWORD=supersecret \
-e POLL_INTERVAL=300 \
-e BATCH_SIZE=100 \
zroc-planner:latest
```
### Docker Compose
Add to your existing `docker-compose.yaml`:
```yaml
zroc-planner:
build: ./zroc-planner
container_name: zroc-planner
hostname: zroc-planner
ports:
- "9272:9272"
environment:
- VCENTER_HOST=vcenter.corp.example
- VCENTER_USER=svc-readonly@vsphere.local
- VCENTER_PASSWORD=supersecret
- POLL_INTERVAL=300
- BATCH_SIZE=100
- BATCH_DELAY=0.5
- VM_INVENTORY_TTL=600
- LOG_LEVEL=INFO
networks:
- back-tier
restart: always
```
Then add a scrape job to `prometheus/prometheus.yml`:
```yaml
scrape_configs:
- job_name: vcenter_planner
scrape_interval: 300s
scrape_timeout: 30s
static_configs:
- targets: ['zroc-planner:9272']
```
### Local (dev)
```bash
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
export VCENTER_HOST=vcenter.local
export VCENTER_USER=administrator@vsphere.local
export VCENTER_PASSWORD=password
python server.py
```
## Endpoints
| Path | Description |
|---|---|
| `GET /metrics` | Prometheus text exposition |
| `GET /health` | JSON health check (200 = healthy, 503 = degraded) |
### Health check response
```json
{
"status": "ok",
"last_collection_time": 1712345678.0,
"last_collection_duration_seconds": 4.23,
"last_vm_count": 3412,
"last_error": null,
"collection_cycles": 42,
"stale": false
}
```
## Scaling notes
| Environment size | Recommended settings |
|---|---|
| ≤ 500 VMs | `BATCH_SIZE=100`, `BATCH_DELAY=0.5` |
| 5005 000 VMs | `BATCH_SIZE=150`, `BATCH_DELAY=0.5` |
| 5 00015 000 VMs | `BATCH_SIZE=200`, `BATCH_DELAY=1.0` |
Collection of 15 000 VMs at `BATCH_SIZE=200` with `BATCH_DELAY=1.0` takes
roughly 75 batches × ~1 s each = **~75 seconds**, comfortably within the
300-second poll window.
## vCenter permissions
The service account only needs the read-only role on the vCenter root:
- `System.View`
- `Performance.ModifyIntervals` (read-only — needed to query counter definitions)
A standard **Read-Only** vCenter role is sufficient.
## Architecture
```
┌────────────────────────────────────────────────────────────┐
│ server.py (main thread) │
│ HTTPServer /metrics /health │
└────────────────────────┬───────────────────────────────────┘
│ reads from
┌────────────────────────▼───────────────────────────────────┐
│ collector.MetricStore (thread-safe dict) │
└────────────────────────▲───────────────────────────────────┘
│ writes to
┌────────────────────────┴───────────────────────────────────┐
│ collector.VCenterCollector (daemon thread) │
│ ┌──────────────────────────────────────────────────┐ │
│ │ every POLL_INTERVAL seconds: │ │
│ │ 1. ensure vCenter session alive │ │
│ │ 2. refresh VM inventory (if TTL expired) │ │
│ │ 3. for each batch of BATCH_SIZE VMs: │ │
│ │ QueryPerf → parse → MetricStore.update │ │
│ │ sleep(BATCH_DELAY) │ │
│ └──────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────┘
```