Chariot

// How it works

From a CLI to a fleet of agents, in five steps.

Configure the CLI, point it at your stack, prove quality with evals, and deploy. From there the Command Center runs the whole fleet — every agent observable, recoverable, and updatable.

01// Configure

Install the CLI.
Point it at your stack.

One CLI is the control surface for everything that follows. Install it, log in, and register the endpoints your agents authenticate and talk through — the same endpoints they'll use in production.

Authenticated in one step — no keys to copy-paste.
Register your auth and delivery endpoints once; every command reuses them.
Works the same on a laptop and in CI.
chariot ~ configure
authenticated
$ brew install chariot
$ chariot login
authenticated as platform@awesomeapp.io
$ chariot endpoints set \
--auth "https://awesomeapp.io/auth" \
--deliver "https://awesomeapp.io/chariot"
endpoints registered · ready
02// Setup

Wire up data, tools,
and the agent itself.

An agent is its instructions, its skills, and the data it can reach. Define all three in version-controlled files — global data shared across the fleet, per-user data scoped to each subject, plus the skills and AGENTS.md that shape behavior.

Chariot auth — scoped tokens, per subject, never shared.
An MCP for global data, and a per-user MCP link for private data.
Skills and an AGENTS.md that define how the agent works.
agent ~ setup
4 parts
global MCP · shareduser MCP · per-subject
AGENTS.mdagent instructions
skills/refund.skill.md · escalate.skill.md
mcp.globalcatalog · pricing (shared)
mcp.user{{subject}} orders · prefs
authchariot oauth · scoped tokens
Global data is shared; per-user data never crosses subjects.
03// Create evals

Prove it before
it ships.

An eval is a prompt plus what you expect back. Check the output objectively with a grep, subjectively with an LLM judge, and assert the exact tool calls the agent should make. Run it against faked MCP fixtures or your real MCP — your choice, per eval.

Expected output, checked by grep (objective) or LLM judge (subjective).
Assert the tool calls the agent must — and must not — make.
Test data from a fixture (faked MCP) or against your real MCP.
evals/refund-request.eval
1 scenario
prompt
"Customer asks for a refund on order #4471"
expect
tool_callslookup_order, issue_refund
grep/refund of \$\d+/objective
judge"polite, confirms amount"llm
fixturesmock-mcp · orders.fixture.json
31 / 31 passing · quality 98.7
04// Deploy

One line.
The whole fleet.

When the evals pass, ship. A single command provisions every agent warm, opens a two-way channel to your backend, and hands you a live dashboard — no queue, no cold start, no per-agent setup.

Thousands of agents live the moment the command returns.
Your backend connected from line one — fully bidirectional.
The Command Center spins up with the same command.
chariot ~ deploy
$ chariot deploy --count 9999 --endpoint "https://awesomeapp.io/chariot" --token-seed "ts_1234567890abcdef"
10,000 agents live · channel open · dashboard ready
05// Command Center

The operator layer for thousands of persistent agents.

Once agents are live, the Command Center is where you run them — the control plane that makes a persistent fleet observable, recoverable, and updatable, without treating each agent like a bespoke snowflake.

// tracked for every agent
identitystatusowner / subjectruntime versionwake / sleeprecent taskstool callsdelivery eventslatencyfailurescost
// at fleet scale, the control plane answers
?which agents are healthy?
?which agents are sleeping?
?which agents are stuck?
?which agents are expensive?
?which agents are safe to roll out?
Investigate no-reply and failed-tool cases — replay what happened.
Compare versions, stage changes, and run eval-backed rollouts.
Make targeted or fleet-wide updates safely.
command/fleet.live
live
agents
10,000 · 99.1% healthy · 71 sleeping
agent #a4471no-reply
subjectuser_8821
versionv2.3
last taskrefund #4471
latency1.2s
last toolissue_refund ✕
cost / mo$2.84
>rollout v2.4 → fleet --require evalsSTAGE
Observable
Every agent is debuggable without SSH-ing into a pod — identity, state, recent tasks, tool calls, and cost, all on one pane.
Recoverable
Find the no-reply and failed-tool cases, replay what happened, and bring wedged or sleeping agents back without touching the box.
Updatable
Compare versions, stage a change, prove it with an eval-backed rollout, then ship it to one agent or the whole fleet — safely.

Five steps from your stack to a fleet you can run.

Configure, set up, prove with evals, deploy — and operate every agent from one console. See it on your data, with your tools.