zroc/zroc-planner/README.md

# zROC Planner — vCenter Metrics Collector

A Python-based Prometheus exporter that replaces zPlanner's PowerCLI scripts.
It queries vCenter for per-VM virtual-disk I/O statistics using the pyvmomi SDK
and exposes them on a `/metrics` endpoint for Prometheus to scrape.

## Metrics exposed

| Prometheus metric | Unit | Description |
|---|---|---|
| `vcenter_vm_disk_write_iops` | IOPS | Write IOPS (sum across all disk instances) |
| `vcenter_vm_disk_write_throughput_mbps` | MB/s | Write throughput (sum across all disk instances) |
| `vcenter_vm_disk_write_latency_ms` | ms | Write latency (mean across all disk instances) |

Every metric carries these labels:

| Label | Example | Notes |
|---|---|---|
| `vm_name` | `web-prod-01` | VM display name |
| `vm_moref` | `vm-1234` | vCenter Managed Object Reference (stable ID) |
| `cluster` | `Cluster-01` | Compute cluster name |
| `host` | `esxi-01.corp` | ESXi host the VM is running on |
| `datacenter` | `DC-East` | vCenter datacenter name |

### Self-monitoring metrics

| Metric | Description |
|---|---|
| `vcenter_collector_last_collection_timestamp_seconds` | Unix timestamp of the last successful poll |
| `vcenter_collector_last_collection_duration_seconds` | How long the last poll took |
| `vcenter_collector_last_vm_count` | VMs collected in the last cycle |
| `vcenter_collector_cycles_total` | Running count of completed cycles |

## Configuration

All settings are environment variables.

| Variable | Default | Description |
|---|---|---|
| `VCENTER_HOST` | `vcenter.local` | vCenter hostname or IP |
| `VCENTER_USER` | `administrator@vsphere.local` | vCenter username (read-only is sufficient) |
| `VCENTER_PASSWORD` | _(required)_ | vCenter password |
| `VCENTER_PORT` | `443` | vCenter HTTPS port |
| `VCENTER_SSL_VERIFY` | `false` | Set `true` to enforce TLS certificate validation |
| `POLL_INTERVAL` | `300` | Seconds between collection cycles |
| `BATCH_SIZE` | `100` | VMs per QueryPerf call (VMware recommends ≤ 200) |
| `BATCH_DELAY` | `0.5` | Seconds to sleep between batches |
| `VM_INVENTORY_TTL` | `600` | Seconds between VM inventory refreshes |
| `PERF_INTERVAL_ID` | `300` | vCenter rollup interval (300 = 5-minute stats) |
| `HTTP_HOST` | `0.0.0.0` | IP the HTTP server binds to |
| `HTTP_PORT` | `9272` | Port for `/metrics` and `/health` |
| `LOG_LEVEL` | `INFO` | Python log level (`DEBUG`, `INFO`, `WARNING`, `ERROR`) |

## Running

### Docker (recommended)

```bash
docker build -t zroc-planner:latest .

docker run -d \
  --name zroc-planner \
  -p 9272:9272 \
  -e VCENTER_HOST=vcenter.corp.example \
  -e VCENTER_USER=svc-readonly@vsphere.local \
  -e VCENTER_PASSWORD=supersecret \
  -e POLL_INTERVAL=300 \
  -e BATCH_SIZE=100 \
  zroc-planner:latest
```

### Docker Compose

Add to your existing `docker-compose.yaml`:

```yaml
  zroc-planner:
    build: ./zroc-planner
    container_name: zroc-planner
    hostname: zroc-planner
    ports:
      - "9272:9272"
    environment:
      - VCENTER_HOST=vcenter.corp.example
      - VCENTER_USER=svc-readonly@vsphere.local
      - VCENTER_PASSWORD=supersecret
      - POLL_INTERVAL=300
      - BATCH_SIZE=100
      - BATCH_DELAY=0.5
      - VM_INVENTORY_TTL=600
      - LOG_LEVEL=INFO
    networks:
      - back-tier
    restart: always
```

Then add a scrape job to `prometheus/prometheus.yml`:

```yaml
scrape_configs:
  - job_name: vcenter_planner
    scrape_interval: 300s
    scrape_timeout: 30s
    static_configs:
      - targets: ['zroc-planner:9272']
```

### Local (dev)

```bash
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

export VCENTER_HOST=vcenter.local
export VCENTER_USER=administrator@vsphere.local
export VCENTER_PASSWORD=password

python server.py
```

## Endpoints

| Path | Description |
|---|---|
| `GET /metrics` | Prometheus text exposition |
| `GET /health` | JSON health check (200 = healthy, 503 = degraded) |

### Health check response

```json
{
  "status": "ok",
  "last_collection_time": 1712345678.0,
  "last_collection_duration_seconds": 4.23,
  "last_vm_count": 3412,
  "last_error": null,
  "collection_cycles": 42,
  "stale": false
}
```

## Scaling notes

| Environment size | Recommended settings |
|---|---|
| ≤ 500 VMs | `BATCH_SIZE=100`, `BATCH_DELAY=0.5` |
| 500–5 000 VMs | `BATCH_SIZE=150`, `BATCH_DELAY=0.5` |
| 5 000–15 000 VMs | `BATCH_SIZE=200`, `BATCH_DELAY=1.0` |

Collection of 15 000 VMs at `BATCH_SIZE=200` with `BATCH_DELAY=1.0` takes
roughly 75 batches × ~1 s each = **~75 seconds**, comfortably within the
300-second poll window.

## vCenter permissions

The service account only needs the read-only role on the vCenter root:

- `System.View`
- `Performance.ModifyIntervals` (read-only — needed to query counter definitions)

A standard **Read-Only** vCenter role is sufficient.

## Architecture

```
┌────────────────────────────────────────────────────────────┐
│  server.py (main thread)                                   │
│    HTTPServer  /metrics  /health                           │
└────────────────────────┬───────────────────────────────────┘
                         │  reads from
┌────────────────────────▼───────────────────────────────────┐
│  collector.MetricStore  (thread-safe dict)                 │
└────────────────────────▲───────────────────────────────────┘
                         │  writes to
┌────────────────────────┴───────────────────────────────────┐
│  collector.VCenterCollector  (daemon thread)               │
│    ┌──────────────────────────────────────────────────┐   │
│    │  every POLL_INTERVAL seconds:                    │   │
│    │    1. ensure vCenter session alive               │   │
│    │    2. refresh VM inventory (if TTL expired)      │   │
│    │    3. for each batch of BATCH_SIZE VMs:          │   │
│    │         QueryPerf → parse → MetricStore.update   │   │
│    │         sleep(BATCH_DELAY)                       │   │
│    └──────────────────────────────────────────────────┘   │
└────────────────────────────────────────────────────────────┘
```