Metrics¶
Observability follows RED (rate, errors, duration) and USE (utilization, saturation, errors) patterns. Metrics are Prometheus-style. Dashboard JSON and alert rule files are internal to the Astra repo (PRD §17).
Metric catalog¶
| Metric | Type | Description |
|---|---|---|
astra_task_latency_seconds |
Histogram | End-to-end task execution duration |
astra_task_success_total |
Counter | Tasks completed successfully |
astra_task_failure_total |
Counter | Tasks that failed or timed out |
astra_events_processed_total |
Counter | Events processed by the event loop |
astra_actor_count |
Gauge | Active actor / agent count |
astra_worker_heartbeat_total |
Counter | Worker heartbeat signals received |
astra_llm_token_usage_total |
Counter | Total LLM tokens consumed (label: model) |
astra_llm_cost_dollars |
Counter | Cumulative LLM cost in USD (label: model) |
astra_scheduler_ready_queue_depth |
Gauge | Tasks waiting in the ready queue (per shard) |
Alert thresholds¶
| Alert | Condition |
|---|---|
| High task failure rate | astra_task_failure_total / (astra_task_success_total + astra_task_failure_total) > 5% over 5 min |
| High queue depth | astra_scheduler_ready_queue_depth > 10000 pending tasks |
| Low worker availability | Registered workers < 50% of expected pool |
| LLM cost spike | Daily astra_llm_cost_dollars rate > 2× rolling daily average |
Tracing¶
Distributed traces link requests, task runs, and tool calls via shared correlation IDs. Stack details: PRD §17.
Dashboards¶
Platform dashboard (gateway-hosted) and Grafana views summarise health, cost, and throughput — see PRD for surfaces.