Service Health JSON Schema
Overview
AGI.serve(..., action="health") returns a machine-readable snapshot with schema
agi.service.health.v1 and writes the same payload to health.json.
Core fields
Field |
Type |
Description |
|---|---|---|
|
string |
Payload schema identifier. Current value: |
|
number |
Unix timestamp (seconds) when the snapshot was generated. |
|
string |
App name used to build |
|
string |
App target identifier used in AGILab paths/state. |
|
string |
Service status (typically |
|
array[string] |
Workers currently considered running. |
|
array[string] |
Workers pending/not fully stopped during transition phases. |
|
array[string] |
Workers auto-restarted during the latest status check. |
|
array[string] |
Workers evaluated as healthy. |
|
array[string] |
Workers evaluated as unhealthy. |
|
integer |
Count of |
|
integer |
Count of |
|
integer |
Count of |
|
integer |
Count of |
|
integer |
Count of |
|
object |
Pending/running/done/failed queue counts for service artifacts. |
Optional fields
client_status: Dask client status when available.heartbeat_timeout_sec: effective timeout used by health evaluation.queue_dir: service queue root path.cleanup: cleanup counters (done/failed/heartbeats removed).restart_reasons: mapworker -> reasonfor auto-restarts.worker_health: per-worker detailed rows used in the ORCHESTRATE health table.path: present on directaction="health"responses when JSON export succeeds.
Minimal monitoring rule
Fail the check when one of these conditions is true:
statusiserrorordegraded.workers_unhealthy_countis greater than0.statusisidleand your operational policy requires an active service.workers_restarted_count / workers_running_countexceeds your restart-rate SLA.
CLI example
From the AGILab repository, run:
uv run python tools/service_health_check.py \
--app mycode_project \
--apps-path src/agilab/apps/builtin
Optional flags:
--format prometheusto emit Prometheus-friendly metrics.--allow-idle/--no-allow-idleto override app defaults.--max-unhealthy <N>and--max-restart-rate <R>to override app defaults.
By default, the checker reads thresholds from [cluster.service_health] in
the target app app_settings.toml.