Cluster
Overview
A multi-node cluster is optional. It lets you run AGILab workloads at scale by spawning workers on remote machines (typically via Dask Distributed).
Principle
A cluster is a set of machines (nodes) reachable over the network.
One node acts as the scheduler and others act as workers (Dask terminology).
AGILab can deploy and run your project on the nodes, but it requires SSH access to each machine (a running SSH server on the workers, and an SSH client on the machine where you launch AGILab).
Normal AGILAB Workflow
Most users should not begin by hand-writing AGI.install(...) or
AGI.run(...) code for cluster execution.
The normal workflow is:
Configure distributed settings in ORCHESTRATE.
Let ORCHESTRATE generate the install, distribution, and run snippets for the current cluster definition.
Reuse the generated run snippet in WORKFLOW when you want the distributed orchestration to become a Pipeline stage.
See Distributed Workers for the practical guide.
Repeatable cluster proof
Use this after the local Quick-Start proof works. The goal is a small, repeatable two-node validation before broader distributed experiments.
The Flight cluster doctor creates tiny synthetic Flight CSVs, mirrors them to
the remote worker under the same $HOME/localshare/... path, then validates
the cluster-share contract before running compute:
the scheduler writes a sentinel under
--cluster-shareeach remote worker must read it through
--remote-cluster-shareafter
AGI.installplusAGI.runin Dask mode, the scheduler must see worker outputs through the local cluster-share path
A remote-only output directory is reported as a failure because it does not prove a shared cluster filesystem.
Discover candidate workers
If you do not know which LAN machines are ready to use, run discovery first:
agilab doctor --discover-lan --remote-user <worker-user>
The discovery pass combines passive sources such as known_hosts, SSH config,
the ARP table, and the local AGILAB LAN cache with a bounded SSH-port scan of
local private /24 networks. It does not guess passwords. Each reachable node
is classified by SSH BatchMode auth, operating system, python3, uv,
sshfs, and reverse SSH back to the scheduler when --scheduler is
provided.
Windows managers can run discovery when the OpenSSH client is installed; AGILAB
parses Windows ipconfig and arp -a output to find local LAN candidates.
Windows remote workers are not covered by this cluster proof yet. Worker
probing, SSHFS setup, and generated install/run commands currently assume a
POSIX shell on Linux or macOS workers.
Use --json or --summary-json when automation needs the machine-readable
report:
agilab doctor --discover-lan \
--remote-user <worker-user> \
--scheduler <scheduler-host> \
--cidr <lan-cidr> \
--json
Use the reported ready nodes as explicit --workers values. If discovery
reports ssh-auth-needed, python-missing, uv-missing,
sshfs-missing, or reverse-ssh-needed, fix that prerequisite before
running the cluster-share setup.
SSHFS prerequisites by operating system
AGILAB’s automatic SSHFS setup runs on each remote worker over SSH. The worker
must therefore provide a POSIX shell, sshfs, a FUSE implementation, reverse
SSH back to the scheduler, and a trusted scheduler host key.
Linux worker
On Debian or Ubuntu workers:
sudo apt-get update
sudo apt-get install -y sshfs
command -v sshfs
Then verify reverse SSH from the worker to the scheduler account:
scheduler_user="<scheduler-user>"
scheduler_host="<scheduler-host>"
ssh -o BatchMode=yes "$scheduler_user@$scheduler_host" hostname
If host-key trust is missing, verify the scheduler fingerprint out of band, then
seed the worker known_hosts file:
scheduler_host="<scheduler-host>"
mkdir -p ~/.ssh
chmod 700 ~/.ssh
ssh-keyscan -H -t ed25519,rsa,ecdsa "$scheduler_host" >> ~/.ssh/known_hosts
For Fedora/RHEL-family workers, install the distribution SSHFS/FUSE package
(sshfs or fuse-sshfs depending on the release), then rerun the same
command -v sshfs and reverse-SSH checks.
macOS worker
On macOS workers, make the SSHFS prerequisite explicit before running
--setup-share:
install a FUSE-backed SSHFS implementation such as FUSE-T SSHFS or macFUSE plus SSHFS
ensure
sshfsis visible to non-interactive SSH commands, for examplessh <worker> 'command -v sshfs'ensure the worker can SSH back to the scheduler user referenced by
--scheduler, because the worker-side mount command reads<scheduler-user>@<scheduler>:/...ensure the worker trusts the scheduler host key before running
--setup-shareor ORCHESTRATEINSTALL. For a new scheduler host, verify the fingerprint out of band, then seedknown_hostson the worker withssh-keyscan -H <scheduler-host> >> ~/.ssh/known_hosts.
On older macOS hosts, Homebrew may exist at /usr/local/Homebrew/bin/brew
without being on the SSH PATH. If command -v brew is empty, check that
location before assuming no package manager exists. If sshfs lands under
/usr/local/bin, add that directory to the remote user’s non-interactive shell
startup, then re-check with ssh <worker> 'command -v sshfs'.
Typical FUSE-T SSHFS setup:
HOMEBREW_NO_AUTO_UPDATE=1 brew install macos-fuse-t/homebrew-cask/fuse-t-sshfs
command -v sshfs
If non-interactive SSH cannot see Homebrew binaries, add the Homebrew binary directory to the remote user’s shell startup:
case ":$PATH:" in
*:/opt/homebrew/bin:*) ;;
*) export PATH="/opt/homebrew/bin:$PATH" ;;
esac
case ":$PATH:" in
*:/usr/local/bin:*) ;;
*) export PATH="/usr/local/bin:$PATH" ;;
esac
Then verify reverse SSH and host-key trust exactly as for Linux workers.
Windows manager or scheduler
Windows can be used as an AGILAB UI/manager or scheduler for a cluster whose remote workers are Linux/macOS. Install and enable the OpenSSH client, then use the same AGILAB doctor commands from PowerShell:
Get-WindowsCapability -Online -Name OpenSSH.Client*
Add-WindowsCapability -Online -Name OpenSSH.Client~~~~0.0.1.0
ssh -V
Keep the cluster-share setting portable when workers are not Windows:
$env:AGI_CLUSTER_SHARE = "clustershare/agilab-two-node"
Seed SSH host trust from PowerShell when the Windows scheduler must SSH to a worker:
$workerHost = "<worker-host>"
New-Item -ItemType Directory -Force "$HOME\.ssh" | Out-Null
ssh-keyscan -H -t ed25519,rsa,ecdsa $workerHost | Out-File -Append -Encoding ascii "$HOME\.ssh\known_hosts"
Windows as a remote cluster worker is a separate support target and is not
covered by the automatic SSHFS setup today. The generated remote setup commands
expect POSIX tools such as mount, sshfs, fusermount3/fusermount,
or umount. If you need a Windows worker, treat it as a manual validation
target first, for example with WinFsp/SSHFS-Win, and do not rely on
--setup-share sshfs --apply until AGILAB has explicit Windows-worker
support.
Cleanup or scheduler switch
If a worker is moved to another scheduler, or an SSHFS session is left stale after a crash, unmount the old target on the worker before rerunning setup:
ssh <worker-user>@<worker-host> '
REMOTE_SHARE="$HOME/clustershare/agilab-two-node"
fusermount3 -u "$REMOTE_SHARE" 2>/dev/null ||
fusermount -u "$REMOTE_SHARE" 2>/dev/null ||
umount "$REMOTE_SHARE" 2>/dev/null ||
true
'
Then rerun agilab doctor --cluster --setup-share sshfs --apply or
ORCHESTRATE INSTALL. The automatic installer already attempts this cleanup
for stale, unexpected-source, or unwritable mounts, but doing it manually makes
scheduler switches explicit and easier to audit.
Run the cluster proof
From a source checkout:
uv --preview-features extra-build-dependencies run python tools/cluster_flight_validation.py \
--cluster \
--scheduler <scheduler-host> \
--workers <worker-user>@<worker-host> \
--cluster-share /path/to/scheduler/clustershare/agilab-two-node \
--remote-cluster-share /path/to/worker/clustershare/agilab-two-node
From an installed package, use the same doctor through the public CLI:
agilab doctor --cluster \
--scheduler <scheduler-host> \
--workers <worker-user>@<worker-host> \
--cluster-share /path/to/scheduler/clustershare/agilab-two-node \
--remote-cluster-share /path/to/worker/clustershare/agilab-two-node
The narrow release gate after any share repair is the standalone
--share-check-only command above. Rerun the full Flight cluster validation
only when you need fresh install, compute, and output-visibility evidence.
For a stricter two-node proof, run with only the remote worker in
--workers. The install log should show AGILAB adding
dask[distributed] to the generated wenv/<app>_worker environment before
launching the remote dask worker process, and the run log should show the
remote worker executing the Flight batches. The scheduler must then see the
remote outputs through --cluster-share.
SSH key setup
You typically generate a key pair on the machine running AGILab and copy the public key to each worker node so the deploy/run steps can connect without interactive prompts.