1. Claim
Agents can explore many CLIs. Reliable automation needs evidence.
A capable agent can often try --help, run a command,
inspect an error, and recover. That does not mean the CLI has a
usable agent-facing contract. Exploration consumes tokens, creates
latency, risks side effects, and can confuse command existence
with missing auth, fixtures, project context, or runtime services.
CLIARE exists to separate "the agent might figure it out" from "the agent has evidence for these navigation capabilities." That difference matters to maintainers who want to improve a CLI and to harness builders who need a shape file before automatic routing.
2. Evidence
Terminal and tool-use research points to interface shape as a real factor.
Terminal-agent benchmarks evaluate agents through execution, not prose comprehension alone. Terminal-Bench 2.0 curates hard command-line tasks with tests. TerminalWorld builds tasks from real terminal recordings. TUA-Bench evaluates routine and scientific terminal workflows. InterCode frames coding work as an action-observation loop over Bash, SQL, and Python environments.
Agent-computer-interface work points in the same direction. SWE-agent shows that a custom interface can materially improve software-agent performance. CodeAct, Toolformer, Gorilla, API-Bank, ToolLLM, AppWorld, and ToolEmu all reinforce the same practical requirement: agents need precise, current, executable tool interfaces with argument constraints, result interpretation, and safety boundaries.
Benchmarks grade outcomes
Runtime behavior is the basis for evaluation, so CLIARE treats runtime behavior as the proof source.
Harnesses change success
A better interface can make the same model more effective, so CLI shape is a product surface.
Tool use has collateral risk
Side effects, secrets, destructive commands, and hidden state must be visible to the harness.
3. Model
CLIARE measures a CLI as a black-box runtime system.
The target CLI is modeled as a process boundary. Given argv,
stdin, environment, cwd, and filesystem state, the CLI returns an
exit code, stdout, stderr, duration, and possible filesystem
deltas. CLIARE records those observations in
evidence.jsonl.
From that evidence, CLIARE infers claims about commands, flags, operands, output contracts, preconditions, and gaps. It does not assume the CLI was built with Clap, Cobra, Click, argparse, oclif, or any other framework.
4. Algorithm
The current algorithm is deterministic and evidence-backed.
Bootstrap probes
Seed safe probes such as root help, version, invalid command, and invalid flag.
Layout extraction
Parse semi-structured help-like text for usage lines, aligned command rows, option rows, examples, and output-mode hints.
Claim updates
Apply additive log-odds evidence weights for command, flag, and output-contract claims.
Planner frontier
Schedule bounded follow-up probes for discovered command paths, diagnostics, and output modes.
Artifact projection
Render shape, command index, scorecard, issue ledger, summary, and persona packets from the same evidence set.
The shipped v0 model lives in
crates/cliare-inference/score-models/cliare-score-v0.json.
Scorecards embed the model id and hash so a score can be tied to
both evidence and model version.
5. Score views
CLIARE separates maintainer readiness from harness shape confidence.
The headline score is useful for CI, but the two most important views answer different questions.
How agent-effective is this CLI?
This view summarizes whether the CLI helps or hinders agent use across discovery, grammar, outputs, recovery, preconditions, safety, and execution behavior.
Can a harness rely on the emitted shape?
This view estimates how much automatic routing can depend on the emitted command index before additional verification, fixtures, or human approval.
The current status is experimental_partial. The long
term target is a calibrated expected-utility score:
6. Design signals
Agent-ready CLIs expose evidence for navigation capabilities.
Agents cannot reliably navigate anything. They can try, but reliability falls when a CLI lacks stable discovery, explicit operands, parseable outputs, actionable errors, and safe diagnostic behavior.
Canonical help everywhere
Every real command should support direct --help without side effects.
Stable usage syntax
Required and optional operands should appear in usage lines and examples.
Parseable result modes
Use --json or --format json for command results agents need to consume.
Actionable diagnostics
Invalid commands, flags, missing operands, auth, and context errors should say what is missing.
Hidden state is explicit
Auth, project directories, fixtures, network, and daemon requirements should be visible before execution.
Discovery is read-only
Help, version, and diagnostic probes should not create caches, configs, credentials, or other persistent files.
Sources
Research used by the CLIARE model and evaluation plan.
Terminal and CLI agents
Terminal-Bench 2.0, TerminalWorld, TUA-Bench, and InterCode.
Software agent interfaces
Tool and API use
ReAct, Toolformer, Gorilla, API-Bank, and ToolLLM.
Safety and dynamic environments
Computer-use benchmarks
AgentBench, WebArena, OSWorld, and OSWorld 2.0.
CLIARE docs
See the repository docs for the computational scoring model, generic inference processor, and citation inventory.