Distributed Workers
===================

AGILAB supports distributed execution across remote worker machines, but the
usual workflow is UI-driven rather than handwritten.

.. important::

   You usually do not write ``AGI.install(...)`` or ``AGI.run(...)`` by hand.
   Configure the cluster in **ORCHESTRATE**, let AGILAB generate the snippet,
   then import or regenerate that generated step in **PIPELINE**.

For most users, the recommended sequence is:

1. Configure scheduler, workers, and execution flags in :doc:`execute-help`.
2. Let **ORCHESTRATE** generate the ``AGI.install(...)``,
   ``AGI.get_distrib(...)``, or ``AGI.run(...)`` snippet for the current setup.
3. Reuse that generated snippet in :doc:`experiment-help` when you want the
   distributed run to become a reproducible pipeline step.

You normally do not start by writing cluster orchestration code from scratch.

.. figure:: diagrams/distributed_orchestrate_pipeline_handoff.svg
   :alt: Diagram showing the UI-driven workflow from ORCHESTRATE to PIPELINE for distributed workers.
   :align: center
   :class: diagram-panel diagram-wide

   The supported workflow is configure in ORCHESTRATE, generate the snippet there, then reuse that generated step in PIPELINE.

Prerequisites
-------------

Before configuring distributed workers, make sure the environment is ready:

- The machine running AGILAB can reach every worker over the network.
- SSH access works non-interactively from the manager to every worker.
- A shared writable cluster path is mounted on every node with the same
  effective location. In cluster mode, do not rely on ``AGI_LOCAL_SHARE`` as a
  fallback.
- ``uv`` and the required Python runtime are available on the manager and the
  remote workers.
- The target app can be installed cleanly before you scale it to more nodes.

Use :doc:`key-generation`, :doc:`environment`, and :doc:`troubleshooting` if
any of those assumptions are not already true.

Step 1: Configure Distributed Execution in ORCHESTRATE
------------------------------------------------------

Open :doc:`execute-help` and use **System settings** as the source of truth for
cluster execution.

Typical distributed settings include:

- enabling the Dask / cluster execution path
- choosing the scheduler host
- defining the worker host map (for example ``{"192.168.1.21": 1, "192.168.1.22": 1}``)
- enabling or disabling ``pool``, ``cython``, and ``rapids`` according to the
  worker capabilities

These values are persisted in the per-user workspace copy of
``app_settings.toml``, so future snippet generations stay aligned with the same
cluster definition.

.. figure:: _static/page-shots/orchestrate-page.png
   :alt: Screenshot of the ORCHESTRATE page showing system settings, install, distribute, and run areas.
   :align: center
   :class: diagram-panel diagram-wide

   ORCHESTRATE is where you define distributed worker settings, run INSTALL, inspect distribution, and generate the final run snippet.

Step 2: Let ORCHESTRATE Generate the Snippet
--------------------------------------------

Once the distributed settings are configured, ORCHESTRATE generates the
deployment and execution code for you.

Use the generated sections in this order:

- **Install** to generate and run the ``AGI.install(...)`` snippet that stages
  the worker runtime on the selected nodes
- **Distribute** to generate ``AGI.get_distrib(...)`` and inspect how the work
  plan is partitioned before running it
- **Run** to generate the final ``AGI.run(...)`` snippet for the configured
  distributed setup

Treat these snippets as generated operational artifacts, not as examples you
must manually reconstruct first.

Reading ``mode`` and ``modes_enabled``
--------------------------------------

The generated install and run snippets usually contain one of these fields:

.. list-table::
   :header-rows: 1
   :widths: 18 18 64

   * - Field
     - Typical snippet
     - Meaning
   * - ``modes_enabled``
     - ``AGI.install(...)``
     - Bitmask of the execution capabilities that should be staged during
       install.
   * - ``mode``
     - ``AGI.run(...)``
     - One concrete execution mode selected for the current run.

Both use the same bit values derived from the ORCHESTRATE toggles:

.. list-table::
   :header-rows: 1
   :widths: 20 14 66

   * - Toggle
     - Bit value
     - Meaning
   * - ``pool``
     - ``1``
     - Multiprocessing / worker-pool execution path.
   * - ``cython``
     - ``2``
     - Compiled worker path when a Cython build exists.
   * - ``cluster_enabled``
     - ``4``
     - Distributed Dask scheduler / remote worker execution.
   * - ``rapids``
     - ``8``
     - RAPIDS / GPU execution path when supported by the target workers.

So a value such as ``13`` means ``4 + 8 + 1``:
distributed cluster execution, with ``rapids`` and ``pool`` enabled. A value
of ``15`` means all four flags are enabled.

You normally do not enter these integers yourself. ORCHESTRATE computes them
from the current cluster toggles and inserts the decoded value into the
generated snippet.

Quick UI Walkthrough
--------------------

Use this short checklist the first time:

1. In **ORCHESTRATE**, open **System settings** and enter the scheduler host and
   worker map.
2. Use **INSTALL** to stage the worker runtime on the selected machines.
3. Use **CHECK DISTRIBUTE** to inspect the generated ``AGI.get_distrib(...)``
   plan and confirm the partitions land on the intended workers.
4. Use **RUN** to generate the current ``AGI.run(...)`` snippet for that setup.
5. In **PIPELINE**, open **Add step** or **New step**, then import or regenerate
   that generated run step instead of rewriting it manually.

Equivalent Generated Snippets
-----------------------------

If you want the simplest mental model first, start with a local-only run:

.. figure:: diagrams/same_api_local_vs_distributed.svg
   :alt: Compact comparison showing that local and distributed AGI.run snippets share the same API shape.
   :align: center
   :class: diagram-panel diagram-wide

   Read distributed snippets as the same ``AGI.run(...)`` call with a few extra cluster fields.

.. code-block:: python

   import asyncio

   from agi_cluster.agi_distributor import AGI
   from agi_env import AgiEnv

   async def main():
       app_env = AgiEnv(app="mycode_project", verbose=1)
       result = await AGI.run(
           app_env,
           mode=0,  # plain local Python execution
       )
       print(result)

   asyncio.run(main())

That is the same API, but without scheduler, workers, or distributed flags.

Once that mental model is clear, the distributed variant is the same call with
the cluster fields filled in:

ORCHESTRATE emits a snippet equivalent to the current UI configuration. A
distributed ``AGI.run(...)`` snippet typically looks like this:

.. code-block:: python

   import asyncio

   from agi_cluster.agi_distributor import AGI
   from agi_env import AgiEnv

   async def main():
       app_env = AgiEnv(app="mycode_project", verbose=1)
       workers = {
           "192.168.1.21": 1,  # one worker slot on host 1
           "192.168.1.22": 1,  # one worker slot on host 2
       }
       result = await AGI.run(
           app_env,
           scheduler="192.168.1.10",
           workers=workers,
           mode=4,  # distributed cluster execution, no pool/cython/rapids extras
       )
       print(result)

   asyncio.run(main())

In normal usage, you get this from ORCHESTRATE after setting the scheduler and
worker hosts in the UI.

Step 3: Validate the Distribution Before Running
------------------------------------------------

Before launching a large distributed run, use **CHECK DISTRIBUTE** in
ORCHESTRATE.

This gives you:

- the generated ``AGI.get_distrib(...)`` snippet
- a **Distribution tree** view of the current work plan
- the **Workplan** editor so you can reassign partitions to different workers

This step is the fastest way to catch obvious mismatches such as:

- too many partitions for the selected workers
- all partitions being assigned to one host
- cluster settings changed in the UI but an old run snippet still being reused

Step 4: Reuse the Generated Snippet in PIPELINE
-----------------------------------------------

When the distributed run should become part of a repeatable workflow, move to
:doc:`experiment-help`.

The normal reuse path is:

1. Generate the install / distribute / run snippet in ORCHESTRATE.
2. On **PIPELINE**, open **Add step** (or **New step** on a fresh lab).
3. Import the generated snippet as the step source, or regenerate it from the
   latest current settings.
4. Run the imported step from PIPELINE so the distributed orchestration becomes
   part of ``lab_steps.toml`` and the tracked experiment history.

Important: imported snippets are snapshots. If you change worker hosts,
execution flags, or app arguments in ORCHESTRATE, regenerate or re-import the
snippet before running it again in PIPELINE.

.. figure:: _static/page-shots/pipeline-page.png
   :alt: Screenshot of the PIPELINE page showing the lab-step workspace where generated snippets are imported and rerun.
   :align: center
   :class: diagram-panel diagram-wide

   PIPELINE is where the generated distributed snippet becomes a tracked, reusable step in ``lab_steps.toml``.

Best Practices
--------------

Use these habits to keep distributed runs predictable:

- Start with one local scheduler and one remote worker before scaling to many
  nodes.
- Keep ``AGI_CLUSTER_SHARE`` mounted and writable on every node at the same
  effective path.
- Keep cluster share and local share conceptually separate. In cluster mode,
  outputs should land on the shared cluster path, not silently on local-only
  storage.
- Re-run **INSTALL** after dependency changes, worker-environment changes, or
  app updates that affect imports.
- Use **CHECK DISTRIBUTE** before expensive runs so the partitioning matches the
  intended worker layout.
- Size worker counts to the actual workload. More workers do not automatically
  mean better performance if the work plan is small or heavily serialized.
- Keep generated snippets in sync with the current UI state. Do not assume an
  older exported script still matches the latest app configuration.

Troubleshooting
---------------

Common distributed setup failures usually fall into one of these categories:

- **INSTALL hangs or never starts remotely**: verify SSH reachability, keys, and
  host trust.
- **Workers do not join the scheduler**: verify the scheduler host is reachable
  from the workers and that the worker host definitions are correct.
- **Outputs go to the wrong place**: verify cluster mode is enabled and the
  shared cluster path is mounted consistently across nodes.
- **Remote import errors after a successful install**: verify the worker
  environment was rebuilt from the current app and that dependencies are
  declared in the correct ``pyproject.toml`` scope.
- **PIPELINE runs stale cluster code**: regenerate or re-import the snippet from
  ORCHESTRATE after changing worker or app settings.

See also:

- :doc:`execute-help`
- :doc:`experiment-help`
- :doc:`cluster`
- :doc:`key-generation`
- :doc:`troubleshooting`