Execution Playground

The built-in execution playground is the quickest way to show what AGILAB adds on top of a plain dataframe benchmark.

Instead of only comparing libraries, AGILAB compares execution models on the same workload and keeps the whole orchestration path visible.

What is included

Two built-in projects ship the same synthetic workload:

  • execution_pandas_project

  • execution_polars_project

They both read the same generated CSV dataset under execution_playground/dataset and produce grouped benchmark outputs.

The difference is the worker path:

  • ExecutionPandasWorker extends PandasWorker

  • ExecutionPolarsWorker extends PolarsWorker

That lets AGILAB expose not only timing differences, but also the execution style behind them.

Where you see it in the UI

The two apps are run through the normal AGILAB pages. The benchmark value comes from the fact that the same UI flow can drive two different worker families without changing the orchestration path.

ORCHESTRATE page showing deployment toggles and generated execution setup

The benchmark setup uses the normal PROJECT -> ORCHESTRATE flow rather than a separate one-off demo script.

Why this example matters

Many benchmark demos stop at:

  • pandas vs polars

  • local vs distributed

  • Python vs compiled

AGILAB goes one step further:

  • same workload

  • same orchestration flow

  • same benchmark UI

  • different worker/runtime path

This makes it easier to answer the practical question:

Did performance improve because of the library, or because of the execution model?

What the benchmark shows

For this example, the public message is intentionally simple:

  • PandasWorker highlights a process-oriented worker path

  • PolarsWorker highlights an in-process threaded worker path

The pool flag is AGILAB’s external local fan-out. It does not mean the same runtime shape for every dataframe library:

App

AGILAB pool implementation

Performance reading

execution_pandas_project

process-backed worker pool

Pandas can benefit when independent file partitions are spread across worker processes, especially for Python-bound or mixed Python/native sections.

execution_polars_project

thread pool around Polars calls

Polars already manages native parallelism inside the library. Adding an external AGILAB pool can compete with that internal pool and may add scheduling or memory-bandwidth overhead instead of improving throughput.

The benchmark results in ORCHESTRATE then let you compare timings while the rest of AGILAB still shows:

  • install state

  • distribution plan

  • generated snippets

  • exported outputs

Measured local benchmark

The repository ships a reproducible benchmark helper:

uv --preview-features extra-build-dependencies run python tools/benchmark_execution_playground.py --repeats 3 --warmups 1 --worker-counts 1,2,4,8 --rows-per-file 100000 --compute-passes 32 --n-partitions 16

The helper now resolves its built-in app paths from the script location, so it can be launched from any working directory inside or outside the repo root.

Median results from a local run on macOS / Python 3.13.9 with 16 partitions, 100000 rows per file, and 32 compute passes:

These numbers are intentionally useful because the heavier mixed workload separates “more workers” from “better fit”:

  • the pandas process-oriented path is only slightly ahead in local parallel mode at 1 worker (1.772s), then gets worse as worker count rises (2.157s at 8 workers)

  • the polars threaded path improves at 1-2 workers (1.520s, 1.436s) and then converges back toward its steady state (1.564s at 8 workers)

  • AGILAB therefore shows both execution model and worker-count scaling on the same reproducible workload, including the point where an external pool no longer helps because the library already owns the inner parallelism

Raw benchmark artifacts are versioned under:

  • docs/source/data/execution_playground_benchmark.json

Typed Cython kernel proof

execution_pandas_project now includes a second workload shape for the hot numeric section: kernel_mode = "typed_numeric". This is the default for the app. It converts the scoring columns to contiguous float64 arrays and runs the repeated score/checksum loop through a Cython-compatible typed function.

That distinction matters: wrapping Pandas calls in Cython is not a useful proof of Cython acceleration, because Pandas is already executing compiled kernels. The typed kernel keeps the surrounding app realistic while giving Cython a numeric loop where fixed dtypes can remove Python object dispatch.

The kernel-only evidence helper compiles the actual worker source in a temporary Cython extension, then compares the same _typed_numeric_score_kernel through Python and Cython:

uv --preview-features extra-build-dependencies run python tools/benchmark_execution_pandas_cython_kernel.py --rows 100000 --compute-passes 32 --repeats 3 --warmups 1

Latest local evidence on macOS / Python 3.13.13:

Typed numeric kernel speedup for execution_pandas_project

runtime

median_seconds

min_seconds

max_seconds

rows_per_second

speedup_vs_python

checksum

python

0.620132

0.603714

0.620513

161256

1.00

157347.169221

cython

0.002023

0.002015

0.002073

49431537

306.54

157347.169221

Read this as a kernel proof, not an end-to-end runtime promise. Full AGILAB runs still include CSV reads, dataframe grouping, result writes, worker startup, and optional Dask/process orchestration. The reducer records kernel_mode, kernel_runtime, and dtype_contract so the output artifact makes that distinction explicit.

2-node 16-mode matrix

The repository also ships a second helper that benchmarks the full 16-mode matrix on 2 Macs over SSH:

uv --preview-features extra-build-dependencies run python tools/benchmark_execution_mode_matrix.py --remote-host <remote-macos-ip> --scheduler-host <local-macos-ip> --rows-per-file 100000 --compute-passes 32 --n-partitions 16 --repeats 2

--remote-host accepts either host or user@host. For portable use, prefer user@host with the real login user of the remote worker. If you pass only a host or IP, the helper currently defaults to agi@<host> for both the SSH probe/setup steps and the dataset rsync step, so only rely on the bare host form when the remote account is actually named agi.

This run uses:

  • 1 local macOS ARM scheduler/worker

  • 1 remote macOS ARM worker over SSH

  • the same 16 partitions, 100000 rows per file, and 32 compute passes

Mode families

The 16 modes split into 4 families:

  • 0-3: local CPU modes

  • 4-7: 2-node Dask modes

  • 8-11: local modes with the RAPIDS bit requested

  • 12-15: 2-node Dask modes with the RAPIDS bit requested

The compact code column uses the order r d c p:

  • r = RAPIDS requested

  • d = Dask / cluster topology

  • c = Cython requested

  • p = worker pool / local fan-out requested; the backend may be process- or thread-based depending on the worker implementation

In the versioned benchmark artifacts shipped with the repository, the r... and rd... modes are still CPU-only because neither node exposed NVIDIA tooling on that capture. The helper still reports RAPIDS requests explicitly, and on other hardware it can mark local-only RAPIDS rows as GPU-accelerated even if the remote node stays CPU-only.

How to read the matrix quickly

  1. Ignore rows 8-15 for performance interpretation in the versioned capture below: they keep the RAPIDS bit visible, but they are still CPU-only there.

  2. Read the matrix by families, not by isolated rows:

    • local Python/Cython baseline: 0-2

    • local pool/fan-out family: 1-3

    • 2-node Dask family: 4-7

  3. Compare each family back to mode 0 (____) to see whether the execution model is buying you anything.

Visual summary of execution mode families for execution_pandas_project and execution_polars_project

Compact map of the 16 execution modes grouped by topology and runtime family.

ORCHESTRATE table snapshot

The tables below mirror the ORCHESTRATE > Benchmark results expander rather than a separate screenshot. The UI reads benchmark.json, uses the run-mode keys as the row index, and displays the columns mode, timing, nodes, seconds, order, delta, and delta (%). The nodes column shows how many machines were used by the run; the extra topology column below keeps the docs readable outside the app.

Benchmark results snapshot for execution_pandas_project

run_mode

nodes

mode

timing

seconds

order

delta

delta (%)

topology

0

1

python

0.89 seconds

0.885

14

0.345

63.89

local only

1

1

pool of process

0.58 seconds

0.585

6

0.045

8.33

local only

2

1

cython

0.91 seconds

0.910

16

0.370

68.52

local only

3

1

pool and cython

0.57 seconds

0.575

3

0.035

6.48

local only

4

2

dask

0.54 seconds

0.540

1

0.000

0.00

2-node cluster (1 local + 1 remote macOS worker)

5

2

dask and pool

0.61 seconds

0.613

12

0.073

13.52

2-node cluster (1 local + 1 remote macOS worker)

6

2

dask and cython

0.56 seconds

0.561

2

0.021

3.89

2-node cluster (1 local + 1 remote macOS worker)

7

2

dask and pool and cython

0.58 seconds

0.583

5

0.043

7.96

2-node cluster (1 local + 1 remote macOS worker)

8

1

rapids

0.86 seconds

0.860

13

0.320

59.26

local only

9

1

rapids and pool

0.58 seconds

0.585

7

0.045

8.33

local only

10

1

rapids and cython

0.89 seconds

0.885

15

0.345

63.89

local only

11

1

rapids and pool and cython

0.57 seconds

0.575

4

0.035

6.48

local only

12

2

rapids and dask

0.59 seconds

0.586

8

0.046

8.52

2-node cluster (1 local + 1 remote macOS worker)

13

2

rapids and dask and pool

0.60 seconds

0.596

11

0.056

10.37

2-node cluster (1 local + 1 remote macOS worker)

14

2

rapids and dask and cython

0.59 seconds

0.589

10

0.049

9.07

2-node cluster (1 local + 1 remote macOS worker)

15

2

rapids and dask and pool and cython

0.59 seconds

0.588

9

0.048

8.89

2-node cluster (1 local + 1 remote macOS worker)

Benchmark results snapshot for execution_polars_project

run_mode

nodes

mode

timing

seconds

order

delta

delta (%)

topology

0

1

python

0.89 seconds

0.885

14

0.623

237.79

local only

1

1

pool of process

0.43 seconds

0.430

9

0.168

64.12

local only

2

1

cython

0.90 seconds

0.900

16

0.638

243.51

local only

3

1

pool and cython

0.45 seconds

0.445

12

0.183

69.85

local only

4

2

dask

0.31 seconds

0.307

5

0.045

17.18

2-node cluster (1 local + 1 remote macOS worker)

5

2

dask and pool

0.26 seconds

0.262

1

0.000

0.00

2-node cluster (1 local + 1 remote macOS worker)

6

2

dask and cython

0.31 seconds

0.307

6

0.045

17.18

2-node cluster (1 local + 1 remote macOS worker)

7

2

dask and pool and cython

0.30 seconds

0.304

2

0.042

16.03

2-node cluster (1 local + 1 remote macOS worker)

8

1

rapids

0.88 seconds

0.875

13

0.613

233.97

local only

9

1

rapids and pool

0.43 seconds

0.430

10

0.168

64.12

local only

10

1

rapids and cython

0.90 seconds

0.895

15

0.633

241.60

local only

11

1

rapids and pool and cython

0.44 seconds

0.440

11

0.178

67.94

local only

12

2

rapids and dask

0.31 seconds

0.310

7

0.048

18.32

2-node cluster (1 local + 1 remote macOS worker)

13

2

rapids and dask and pool

0.30 seconds

0.305

3

0.043

16.41

2-node cluster (1 local + 1 remote macOS worker)

14

2

rapids and dask and cython

0.31 seconds

0.306

4

0.044

16.79

2-node cluster (1 local + 1 remote macOS worker)

15

2

rapids and dask and pool and cython

0.34 seconds

0.336

8

0.074

28.24

2-node cluster (1 local + 1 remote macOS worker)

execution_pandas_project

Use this app when you want the benchmark to read as a process-oriented baseline.

  • Worker family: ExecutionPandasWorker over PandasWorker

  • Story to tell: how far a process-backed worker pool, Cython typed kernel, and Dask path go on the same workload

  • What to inspect in AGILAB: install/distribution state in ORCHESTRATE, then the benchmark table and exported artifacts for the _d__ family

  • Practical reading: this app is the easiest way to show that “more workers” does not automatically beat the local path unless the execution model fits

16-mode matrix for execution_pandas_project

mode

label

topology

median_seconds

0

python

local only

0.885

1

pool of process

local only

0.585

2

cython

local only

0.910

3

pool and cython

local only

0.575

4

dask

2-node cluster (1 local + 1 remote macOS worker)

0.540

5

dask and pool

2-node cluster (1 local + 1 remote macOS worker)

0.613

6

dask and cython

2-node cluster (1 local + 1 remote macOS worker)

0.561

7

dask and pool and cython

2-node cluster (1 local + 1 remote macOS worker)

0.583

8

rapids

local only

0.860

9

rapids and pool

local only

0.585

10

rapids and cython

local only

0.885

11

rapids and pool and cython

local only

0.575

12

rapids and dask

2-node cluster (1 local + 1 remote macOS worker)

0.586

13

rapids and dask and pool

2-node cluster (1 local + 1 remote macOS worker)

0.596

14

rapids and dask and cython

2-node cluster (1 local + 1 remote macOS worker)

0.589

15

rapids and dask and pool and cython

2-node cluster (1 local + 1 remote macOS worker)

0.588

execution_polars_project

Use this app when you want the benchmark to read as an in-process threaded path with a different scaling profile.

  • Worker family: ExecutionPolarsWorker over PolarsWorker

  • Story to tell: the same workload can prefer a lighter in-process path over a heavier process-oriented topology, because Polars already runs native parallel work inside the library

  • What to inspect in AGILAB: the same ORCHESTRATE > Benchmark results table, but with attention on the _d_p family and how it differs from the pandas app

  • Practical reading: this app is the clearest proof that AGILAB is benchmarking execution models, not only dataframe libraries

16-mode matrix for execution_polars_project

mode

label

topology

median_seconds

0

python

local only

0.885

1

pool of process

local only

0.430

2

cython

local only

0.900

3

pool and cython

local only

0.445

4

dask

2-node cluster (1 local + 1 remote macOS worker)

0.307

5

dask and pool

2-node cluster (1 local + 1 remote macOS worker)

0.262

6

dask and cython

2-node cluster (1 local + 1 remote macOS worker)

0.307

7

dask and pool and cython

2-node cluster (1 local + 1 remote macOS worker)

0.304

8

rapids

local only

0.875

9

rapids and pool

local only

0.430

10

rapids and cython

local only

0.895

11

rapids and pool and cython

local only

0.440

12

rapids and dask

2-node cluster (1 local + 1 remote macOS worker)

0.310

13

rapids and dask and pool

2-node cluster (1 local + 1 remote macOS worker)

0.305

14

rapids and dask and cython

2-node cluster (1 local + 1 remote macOS worker)

0.306

15

rapids and dask and pool and cython

2-node cluster (1 local + 1 remote macOS worker)

0.336

What the matrix adds

This second benchmark makes three extra points visible:

  • the heavier scalar tail now separates the plain local Python/Cython family, the local pool family, and the 2-node Dask family much more clearly

  • the best mode is not the same for the two worker designs: _d__ for execution_pandas_project and _d_p for execution_polars_project

  • pool is not automatically better: it helps most when AGILAB’s external fan-out adds useful parallelism, and less when the dataframe library already manages its own internal pool

  • a 2-node Dask topology can win for one execution model and not for another

  • requesting RAPIDS on hardware without NVIDIA tooling does not create a fake speedup: AGILAB still reports the run honestly as CPU-only

  • local-only RAPIDS rows and 2-node RAPIDS rows are reported independently, so GPU availability now follows the topology that actually ran

Raw matrix artifacts are versioned under:

  • docs/source/data/execution_mode_matrix_benchmark.json

  • docs/source/data/execution_mode_matrix_benchmark.csv

  • docs/source/data/execution_pandas_benchmark_results_snapshot.csv

  • docs/source/data/execution_polars_benchmark_results_snapshot.csv

  • docs/source/data/execution_pandas_project_mode_matrix.csv

  • docs/source/data/execution_polars_project_mode_matrix.csv

How to run it

  1. Launch AGILAB:

    uv --preview-features extra-build-dependencies run streamlit run src/agilab/About_agilab.py
    
  2. In PROJECT, select src/agilab/apps/builtin/execution_pandas_project.

  3. In ORCHESTRATE, run INSTALL once, then EXECUTE.

  4. Enable Benchmark all modes when you want AGILAB to compare execution paths.

  5. Repeat with src/agilab/apps/builtin/execution_polars_project.

  6. Compare the benchmark table in ORCHESTRATE > Benchmark results and the generated outputs.

What to look for

This example is useful when you want to demonstrate that AGILAB makes three things explicit:

  • the workload

  • the orchestration path

  • the execution model

That is why this example is a better public teaser than a raw benchmark chart: it keeps the result, the runtime path, and the reproducible workflow together.