Data Connectors

AGILAB data connectors are a lightweight contract for external data systems. They let an app or evidence report reference a named data source without hard-coding local paths, credentials, or provider-specific client details in the app code.

The current public contract is intentionally conservative:

  • connector definitions live in plain-text TOML catalogs

  • credentials are referenced through environment variables, never embedded

  • public evidence validates contracts without opening external networks

  • live probes stay operator-triggered and optional

  • legacy raw paths can remain available while apps migrate to connector IDs

This is not a second experiment tracker, model registry, or storage UI. It is the data-access contract around AGILAB workflows.

Connector Maturity Levels

Use these labels consistently when reading connector evidence:

Level

What AGILAB proves

What remains outside the proof

Local proof

A deterministic local connector such as SQLite produces query results, artifact hashes, and JSON evidence without network access.

Behavior of the eventual production database, account policy, or network path.

Contract proof

TOML connector definitions, app/page references, credential-reference shape, and runtime dependency mapping are valid.

Real endpoint reachability, IAM, firewall rules, quota, latency, or billing.

Emulator proof

Account-free local emulators match the expected adapter shape for S3, Azure Blob, GCS, or search endpoints.

Real cloud control-plane behavior and managed-service differences.

Operator-triggered live check

An explicit user action probes a real endpoint in a prepared environment.

General certification for every region, tenant, credential, or network policy.

Catalog Shape

The public sample catalog is:

Each connector is a [[connectors]] TOML entry with a stable id, a kind, a human label, and kind-specific fields.

Supported public kinds are:

Kind

Typical target

Contract boundary

sql

read-only warehouse or local SQLite proof

validates URI, driver, and query_mode = "read_only"

opensearch

OpenSearch / ELK index

validates URL, index, and credential reference

object_storage

artifact prefixes in cloud object storage

validates provider, bucket/container, prefix, and credential reference

Object Storage Providers

Object-storage connectors currently support these providers:

Provider

Target URI shape

Runtime dependency

Credential hint

s3

s3://bucket/prefix

boto3

AWS_PROFILE or AWS access-key/session environment

azure_blob

azure_blob://account/container/prefix

azure-storage-blob

AZURE_STORAGE_CONNECTION_STRING or Azure identity environment

gcs

gs://bucket/prefix

google-cloud-storage

GOOGLE_APPLICATION_CREDENTIALS or application-default credentials

The s3 provider also accepts the aliases aws_s3, amazon_s3, and s3_compatible. The runtime dependency column describes what an operator environment needs for live probes; those packages are not required for the default public contract-validation evidence.

SQLite Database Proof

Use the packaged SQLite preview when you need a concrete database demo that works on every local machine without a server, Docker, network access, or secrets:

uv --preview-features extra-build-dependencies run python src/agilab/examples/sqlite_connector_proof/preview_sqlite_connector_proof.py --output-dir /tmp/agilab-sqlite-proof

The preview writes:

/tmp/agilab-sqlite-proof/sqlite_connector_proof.db
/tmp/agilab-sqlite-proof/promotion_candidates.csv
/tmp/agilab-sqlite-proof/database_evidence.json

Read database_evidence.json first. It records a sql connector with the sqlite driver, query_mode = "read_only", a schema hash, a parameterized query hash, row count, result hash, and artifact hashes. This proves the AGILAB database boundary before replacing the local URI with Postgres, a warehouse, or another operator-managed SQL source.

Local Artifact Lane Contract

Use the artifact-lane contract when work enters AGILAB as files rather than a database or cloud connector. It is designed for simple, reviewable handoffs such as a data-analyst bundle with raw files, cleaned tables, aggregates, plots, and a report, or a document-ingestion lane with PDFs, Markdown outputs, and processed originals.

python3 tools/data_artifact_lane_contract.py --profile data-analysis --root <bundle> --check --json

For document ingestion lanes whose folders are not under one root, map the roles explicitly:

python3 tools/data_artifact_lane_contract.py \
  --profile document-ingestion \
  --dir input=/path/to/incoming \
  --dir output=/path/to/markdown \
  --dir done=/path/to/done \
  --check --json

The report uses schema agilab.data_artifact_lane_contract.v1. It records the profile, role directories, required artifact rules, matched artifacts, sizes, SHA-256 hashes, and missing-directory or missing-artifact issues. This proves the local file handoff is present and inspectable. It does not prove data correctness, OCR quality, business interpretation, privacy compliance, or background-service liveness.

Account-Free Cloud Emulator Validation

Use the cloud-emulators profile when you need AWS/Azure/GCP connector confidence without owning cloud accounts:

uv --preview-features extra-build-dependencies run python tools/data_connector_cloud_emulator_report.py --compact
uv --preview-features extra-build-dependencies run python tools/workflow_parity.py --profile cloud-emulators

The profile validates the sample emulator catalog against the same connector facility and runtime-adapter contracts used by real cloud targets. It covers:

Cloud target

Account-free emulator

Local endpoint

What is proven

AWS S3 / S3-compatible storage

MinIO

http://127.0.0.1:9000

provider aliasing, bucket/prefix target shape, boto3 dependency

Azure Blob Storage

Azurite

http://127.0.0.1:10000/devstoreaccount1

account/container target shape, azure-storage-blob dependency

Google Cloud Storage

fake-gcs-server

http://127.0.0.1:4443

gs:// target shape, google-cloud-storage dependency

Search-index wiring

local OpenSearch or Elasticsearch

http://127.0.0.1:9200

URL/index contract and explicit credential boundary

This gives API-contract and emulator-compatible validation only. It does not prove real IAM, cloud firewall rules, private endpoints, regional behavior, quota, or billing. Those remain opt-in live smoke checks in a real operator environment with real credentials.

Credential Rule

Remote connectors must use auth_ref = "env:NAME". The value points to an environment variable name, not to the credential itself.

Examples:

auth_ref = "env:AWS_PROFILE"
auth_ref = "env:AZURE_STORAGE_CONNECTION_STRING"
auth_ref = "env:GOOGLE_APPLICATION_CREDENTIALS"

The reports deliberately avoid materializing credential values. If a connector contains a raw secret-like value, the facility report marks the catalog invalid.

Evidence Reports

The public checks are contract-first:

uv --preview-features extra-build-dependencies run python tools/data_connector_facility_report.py --compact
uv --preview-features extra-build-dependencies run python tools/data_connector_resolution_report.py --compact
uv --preview-features extra-build-dependencies run python tools/data_connector_health_report.py --compact
uv --preview-features extra-build-dependencies run python tools/data_connector_health_actions_report.py --compact
uv --preview-features extra-build-dependencies run python tools/data_connector_runtime_adapters_report.py --compact

Use the live endpoint smoke only when you intentionally want to prove the operator-triggered execution path. The default public mode remains network-free.

How To Read The Boundary

  • facility proves the catalog is structurally valid.

  • resolution proves app/page settings can refer to connector IDs while preserving legacy fallback paths.

  • health plans status probes but does not execute them by default.

  • health_actions exposes explicit operator-triggered probe actions.

  • runtime_adapters maps each connector to the dependency and operation a runtime would need when an operator opts in.

This keeps the first adoption path simple: a new user can run AGILAB without cloud credentials, while an operator can still see exactly which connector, dependency, and environment variable will be needed before enabling live access.