Files
zroc/zroc-planner

zROC Planner — vCenter Metrics Collector

A Python-based Prometheus exporter that replaces zPlanner's PowerCLI scripts. It queries vCenter for per-VM virtual-disk I/O statistics using the pyvmomi SDK and exposes them on a /metrics endpoint for Prometheus to scrape.

Metrics exposed

Prometheus metric Unit Description
vcenter_vm_disk_write_iops IOPS Write IOPS (sum across all disk instances)
vcenter_vm_disk_write_throughput_mbps MB/s Write throughput (sum across all disk instances)
vcenter_vm_disk_write_latency_ms ms Write latency (mean across all disk instances)

Every metric carries these labels:

Label Example Notes
vm_name web-prod-01 VM display name
vm_moref vm-1234 vCenter Managed Object Reference (stable ID)
cluster Cluster-01 Compute cluster name
host esxi-01.corp ESXi host the VM is running on
datacenter DC-East vCenter datacenter name

Self-monitoring metrics

Metric Description
vcenter_collector_last_collection_timestamp_seconds Unix timestamp of the last successful poll
vcenter_collector_last_collection_duration_seconds How long the last poll took
vcenter_collector_last_vm_count VMs collected in the last cycle
vcenter_collector_cycles_total Running count of completed cycles

Configuration

All settings are environment variables.

Variable Default Description
VCENTER_HOST vcenter.local vCenter hostname or IP
VCENTER_USER administrator@vsphere.local vCenter username (read-only is sufficient)
VCENTER_PASSWORD (required) vCenter password
VCENTER_PORT 443 vCenter HTTPS port
VCENTER_SSL_VERIFY false Set true to enforce TLS certificate validation
POLL_INTERVAL 300 Seconds between collection cycles
BATCH_SIZE 100 VMs per QueryPerf call (VMware recommends ≤ 200)
BATCH_DELAY 0.5 Seconds to sleep between batches
VM_INVENTORY_TTL 600 Seconds between VM inventory refreshes
PERF_INTERVAL_ID 300 vCenter rollup interval (300 = 5-minute stats)
HTTP_HOST 0.0.0.0 IP the HTTP server binds to
HTTP_PORT 9272 Port for /metrics and /health
LOG_LEVEL INFO Python log level (DEBUG, INFO, WARNING, ERROR)

Running

docker build -t zroc-planner:latest .

docker run -d \
  --name zroc-planner \
  -p 9272:9272 \
  -e VCENTER_HOST=vcenter.corp.example \
  -e VCENTER_USER=svc-readonly@vsphere.local \
  -e VCENTER_PASSWORD=supersecret \
  -e POLL_INTERVAL=300 \
  -e BATCH_SIZE=100 \
  zroc-planner:latest

Docker Compose

Add to your existing docker-compose.yaml:

  zroc-planner:
    build: ./zroc-planner
    container_name: zroc-planner
    hostname: zroc-planner
    ports:
      - "9272:9272"
    environment:
      - VCENTER_HOST=vcenter.corp.example
      - VCENTER_USER=svc-readonly@vsphere.local
      - VCENTER_PASSWORD=supersecret
      - POLL_INTERVAL=300
      - BATCH_SIZE=100
      - BATCH_DELAY=0.5
      - VM_INVENTORY_TTL=600
      - LOG_LEVEL=INFO
    networks:
      - back-tier
    restart: always

Then add a scrape job to prometheus/prometheus.yml:

scrape_configs:
  - job_name: vcenter_planner
    scrape_interval: 300s
    scrape_timeout: 30s
    static_configs:
      - targets: ['zroc-planner:9272']

Local (dev)

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

export VCENTER_HOST=vcenter.local
export VCENTER_USER=administrator@vsphere.local
export VCENTER_PASSWORD=password

python server.py

Endpoints

Path Description
GET /metrics Prometheus text exposition
GET /health JSON health check (200 = healthy, 503 = degraded)

Health check response

{
  "status": "ok",
  "last_collection_time": 1712345678.0,
  "last_collection_duration_seconds": 4.23,
  "last_vm_count": 3412,
  "last_error": null,
  "collection_cycles": 42,
  "stale": false
}

Scaling notes

Environment size Recommended settings
≤ 500 VMs BATCH_SIZE=100, BATCH_DELAY=0.5
5005 000 VMs BATCH_SIZE=150, BATCH_DELAY=0.5
5 00015 000 VMs BATCH_SIZE=200, BATCH_DELAY=1.0

Collection of 15 000 VMs at BATCH_SIZE=200 with BATCH_DELAY=1.0 takes roughly 75 batches × ~1 s each = ~75 seconds, comfortably within the 300-second poll window.

vCenter permissions

The service account only needs the read-only role on the vCenter root:

  • System.View
  • Performance.ModifyIntervals (read-only — needed to query counter definitions)

A standard Read-Only vCenter role is sufficient.

Architecture

┌────────────────────────────────────────────────────────────┐
│  server.py (main thread)                                   │
│    HTTPServer  /metrics  /health                           │
└────────────────────────┬───────────────────────────────────┘
                         │  reads from
┌────────────────────────▼───────────────────────────────────┐
│  collector.MetricStore  (thread-safe dict)                 │
└────────────────────────▲───────────────────────────────────┘
                         │  writes to
┌────────────────────────┴───────────────────────────────────┐
│  collector.VCenterCollector  (daemon thread)               │
│    ┌──────────────────────────────────────────────────┐   │
│    │  every POLL_INTERVAL seconds:                    │   │
│    │    1. ensure vCenter session alive               │   │
│    │    2. refresh VM inventory (if TTL expired)      │   │
│    │    3. for each batch of BATCH_SIZE VMs:          │   │
│    │         QueryPerf → parse → MetricStore.update   │   │
│    │         sleep(BATCH_DELAY)                       │   │
│    └──────────────────────────────────────────────────┘   │
└────────────────────────────────────────────────────────────┘