mirror of
https://github.com/recklessop/zroc.git
synced 2026-07-03 13:23:15 -04:00
796bafac63
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
188 lines
6.9 KiB
Markdown
188 lines
6.9 KiB
Markdown
# zROC Planner — vCenter Metrics Collector
|
||
|
||
A Python-based Prometheus exporter that replaces zPlanner's PowerCLI scripts.
|
||
It queries vCenter for per-VM virtual-disk I/O statistics using the pyvmomi SDK
|
||
and exposes them on a `/metrics` endpoint for Prometheus to scrape.
|
||
|
||
## Metrics exposed
|
||
|
||
| Prometheus metric | Unit | Description |
|
||
|---|---|---|
|
||
| `vcenter_vm_disk_write_iops` | IOPS | Write IOPS (sum across all disk instances) |
|
||
| `vcenter_vm_disk_write_throughput_mbps` | MB/s | Write throughput (sum across all disk instances) |
|
||
| `vcenter_vm_disk_write_latency_ms` | ms | Write latency (mean across all disk instances) |
|
||
|
||
Every metric carries these labels:
|
||
|
||
| Label | Example | Notes |
|
||
|---|---|---|
|
||
| `vm_name` | `web-prod-01` | VM display name |
|
||
| `vm_moref` | `vm-1234` | vCenter Managed Object Reference (stable ID) |
|
||
| `cluster` | `Cluster-01` | Compute cluster name |
|
||
| `host` | `esxi-01.corp` | ESXi host the VM is running on |
|
||
| `datacenter` | `DC-East` | vCenter datacenter name |
|
||
|
||
### Self-monitoring metrics
|
||
|
||
| Metric | Description |
|
||
|---|---|
|
||
| `vcenter_collector_last_collection_timestamp_seconds` | Unix timestamp of the last successful poll |
|
||
| `vcenter_collector_last_collection_duration_seconds` | How long the last poll took |
|
||
| `vcenter_collector_last_vm_count` | VMs collected in the last cycle |
|
||
| `vcenter_collector_cycles_total` | Running count of completed cycles |
|
||
|
||
## Configuration
|
||
|
||
All settings are environment variables.
|
||
|
||
| Variable | Default | Description |
|
||
|---|---|---|
|
||
| `VCENTER_HOST` | `vcenter.local` | vCenter hostname or IP |
|
||
| `VCENTER_USER` | `administrator@vsphere.local` | vCenter username (read-only is sufficient) |
|
||
| `VCENTER_PASSWORD` | _(required)_ | vCenter password |
|
||
| `VCENTER_PORT` | `443` | vCenter HTTPS port |
|
||
| `VCENTER_SSL_VERIFY` | `false` | Set `true` to enforce TLS certificate validation |
|
||
| `POLL_INTERVAL` | `300` | Seconds between collection cycles |
|
||
| `BATCH_SIZE` | `100` | VMs per QueryPerf call (VMware recommends ≤ 200) |
|
||
| `BATCH_DELAY` | `0.5` | Seconds to sleep between batches |
|
||
| `VM_INVENTORY_TTL` | `600` | Seconds between VM inventory refreshes |
|
||
| `PERF_INTERVAL_ID` | `300` | vCenter rollup interval (300 = 5-minute stats) |
|
||
| `HTTP_HOST` | `0.0.0.0` | IP the HTTP server binds to |
|
||
| `HTTP_PORT` | `9272` | Port for `/metrics` and `/health` |
|
||
| `LOG_LEVEL` | `INFO` | Python log level (`DEBUG`, `INFO`, `WARNING`, `ERROR`) |
|
||
|
||
## Running
|
||
|
||
### Docker (recommended)
|
||
|
||
```bash
|
||
docker build -t zroc-planner:latest .
|
||
|
||
docker run -d \
|
||
--name zroc-planner \
|
||
-p 9272:9272 \
|
||
-e VCENTER_HOST=vcenter.corp.example \
|
||
-e VCENTER_USER=svc-readonly@vsphere.local \
|
||
-e VCENTER_PASSWORD=supersecret \
|
||
-e POLL_INTERVAL=300 \
|
||
-e BATCH_SIZE=100 \
|
||
zroc-planner:latest
|
||
```
|
||
|
||
### Docker Compose
|
||
|
||
Add to your existing `docker-compose.yaml`:
|
||
|
||
```yaml
|
||
zroc-planner:
|
||
build: ./zroc-planner
|
||
container_name: zroc-planner
|
||
hostname: zroc-planner
|
||
ports:
|
||
- "9272:9272"
|
||
environment:
|
||
- VCENTER_HOST=vcenter.corp.example
|
||
- VCENTER_USER=svc-readonly@vsphere.local
|
||
- VCENTER_PASSWORD=supersecret
|
||
- POLL_INTERVAL=300
|
||
- BATCH_SIZE=100
|
||
- BATCH_DELAY=0.5
|
||
- VM_INVENTORY_TTL=600
|
||
- LOG_LEVEL=INFO
|
||
networks:
|
||
- back-tier
|
||
restart: always
|
||
```
|
||
|
||
Then add a scrape job to `prometheus/prometheus.yml`:
|
||
|
||
```yaml
|
||
scrape_configs:
|
||
- job_name: vcenter_planner
|
||
scrape_interval: 300s
|
||
scrape_timeout: 30s
|
||
static_configs:
|
||
- targets: ['zroc-planner:9272']
|
||
```
|
||
|
||
### Local (dev)
|
||
|
||
```bash
|
||
python -m venv venv
|
||
source venv/bin/activate
|
||
pip install -r requirements.txt
|
||
|
||
export VCENTER_HOST=vcenter.local
|
||
export VCENTER_USER=administrator@vsphere.local
|
||
export VCENTER_PASSWORD=password
|
||
|
||
python server.py
|
||
```
|
||
|
||
## Endpoints
|
||
|
||
| Path | Description |
|
||
|---|---|
|
||
| `GET /metrics` | Prometheus text exposition |
|
||
| `GET /health` | JSON health check (200 = healthy, 503 = degraded) |
|
||
|
||
### Health check response
|
||
|
||
```json
|
||
{
|
||
"status": "ok",
|
||
"last_collection_time": 1712345678.0,
|
||
"last_collection_duration_seconds": 4.23,
|
||
"last_vm_count": 3412,
|
||
"last_error": null,
|
||
"collection_cycles": 42,
|
||
"stale": false
|
||
}
|
||
```
|
||
|
||
## Scaling notes
|
||
|
||
| Environment size | Recommended settings |
|
||
|---|---|
|
||
| ≤ 500 VMs | `BATCH_SIZE=100`, `BATCH_DELAY=0.5` |
|
||
| 500–5 000 VMs | `BATCH_SIZE=150`, `BATCH_DELAY=0.5` |
|
||
| 5 000–15 000 VMs | `BATCH_SIZE=200`, `BATCH_DELAY=1.0` |
|
||
|
||
Collection of 15 000 VMs at `BATCH_SIZE=200` with `BATCH_DELAY=1.0` takes
|
||
roughly 75 batches × ~1 s each = **~75 seconds**, comfortably within the
|
||
300-second poll window.
|
||
|
||
## vCenter permissions
|
||
|
||
The service account only needs the read-only role on the vCenter root:
|
||
|
||
- `System.View`
|
||
- `Performance.ModifyIntervals` (read-only — needed to query counter definitions)
|
||
|
||
A standard **Read-Only** vCenter role is sufficient.
|
||
|
||
## Architecture
|
||
|
||
```
|
||
┌────────────────────────────────────────────────────────────┐
|
||
│ server.py (main thread) │
|
||
│ HTTPServer /metrics /health │
|
||
└────────────────────────┬───────────────────────────────────┘
|
||
│ reads from
|
||
┌────────────────────────▼───────────────────────────────────┐
|
||
│ collector.MetricStore (thread-safe dict) │
|
||
└────────────────────────▲───────────────────────────────────┘
|
||
│ writes to
|
||
┌────────────────────────┴───────────────────────────────────┐
|
||
│ collector.VCenterCollector (daemon thread) │
|
||
│ ┌──────────────────────────────────────────────────┐ │
|
||
│ │ every POLL_INTERVAL seconds: │ │
|
||
│ │ 1. ensure vCenter session alive │ │
|
||
│ │ 2. refresh VM inventory (if TTL expired) │ │
|
||
│ │ 3. for each batch of BATCH_SIZE VMs: │ │
|
||
│ │ QueryPerf → parse → MetricStore.update │ │
|
||
│ │ sleep(BATCH_DELAY) │ │
|
||
│ └──────────────────────────────────────────────────┘ │
|
||
└────────────────────────────────────────────────────────────┘
|
||
```
|