1. Claim

Agents can explore many CLIs. Reliable automation needs evidence.

A capable agent can often try --help, run a command, inspect an error, and recover. That does not mean the CLI has a usable agent-facing contract. Exploration consumes tokens, creates latency, risks side effects, and can confuse command existence with missing auth, fixtures, project context, or runtime services.

CLIARE exists to separate "the agent might figure it out" from "the agent has evidence for these navigation capabilities." That difference matters to maintainers who want to improve a CLI and to harness builders who need a shape file before automatic routing.

Human docs Useful context, but not proof of runtime behavior.
Help text Evidence for layout, grammar, and discovery, but still not truth.
Runtime evidence The strongest source for command shape, diagnostics, and side effects.

2. Evidence

Terminal and tool-use research points to interface shape as a real factor.

Terminal-agent benchmarks evaluate agents through execution, not prose comprehension alone. Terminal-Bench 2.0 curates hard command-line tasks with tests. TerminalWorld builds tasks from real terminal recordings. TUA-Bench evaluates routine and scientific terminal workflows. InterCode frames coding work as an action-observation loop over Bash, SQL, and Python environments.

Agent-computer-interface work points in the same direction. SWE-agent shows that a custom interface can materially improve software-agent performance. CodeAct, Toolformer, Gorilla, API-Bank, ToolLLM, AppWorld, and ToolEmu all reinforce the same practical requirement: agents need precise, current, executable tool interfaces with argument constraints, result interpretation, and safety boundaries.

Execution truth

Benchmarks grade outcomes

Runtime behavior is the basis for evaluation, so CLIARE treats runtime behavior as the proof source.

Interface design

Harnesses change success

A better interface can make the same model more effective, so CLI shape is a product surface.

Safety

Tool use has collateral risk

Side effects, secrets, destructive commands, and hidden state must be visible to the harness.

3. Model

CLIARE measures a CLI as a black-box runtime system.

The target CLI is modeled as a process boundary. Given argv, stdin, environment, cwd, and filesystem state, the CLI returns an exit code, stdout, stderr, duration, and possible filesystem deltas. CLIARE records those observations in evidence.jsonl.

\[ M(\mathrm{argv}, \mathrm{stdin}, \mathrm{env}, \mathrm{cwd}, \mathrm{fs\_state}) \rightarrow (\mathrm{exit}, \mathrm{stdout}, \mathrm{stderr}, \Delta_{\mathrm{fs}}, \mathrm{duration}) \]

From that evidence, CLIARE infers claims about commands, flags, operands, output contracts, preconditions, and gaps. It does not assume the CLI was built with Clap, Cobra, Click, argparse, oclif, or any other framework.

4. Algorithm

The current algorithm is deterministic and evidence-backed.

01

Bootstrap probes

Seed safe probes such as root help, version, invalid command, and invalid flag.

02

Layout extraction

Parse semi-structured help-like text for usage lines, aligned command rows, option rows, examples, and output-mode hints.

03

Claim updates

Apply additive log-odds evidence weights for command, flag, and output-contract claims.

04

Planner frontier

Schedule bounded follow-up probes for discovered command paths, diagnostics, and output modes.

05

Artifact projection

Render shape, command index, scorecard, issue ledger, summary, and persona packets from the same evidence set.

\[ \operatorname{logit}(P(C \mid E)) = \operatorname{logit}(p_0) + \sum_i w_i \] \[ P(C \mid E) = \frac{1}{1 + e^{-\ell}} \]

The shipped v0 model lives in crates/cliare-inference/score-models/cliare-score-v0.json. Scorecards embed the model id and hash so a score can be tied to both evidence and model version.

5. Score views

CLIARE separates maintainer readiness from harness shape confidence.

The headline score is useful for CI, but the two most important views answer different questions.

Maintainer readiness

How agent-effective is this CLI?

This view summarizes whether the CLI helps or hinders agent use across discovery, grammar, outputs, recovery, preconditions, safety, and execution behavior.

Harness shape confidence

Can a harness rely on the emitted shape?

This view estimates how much automatic routing can depend on the emitted command index before additional verification, fixtures, or human approval.

The current status is experimental_partial. The long term target is a calibrated expected-utility score:

\[ \mathrm{CLIARE\ Score} = 100 \cdot \mathbb{E}[U(\mathrm{agent}, \mathrm{task}, \mathrm{cli}) \mid E] \]

6. Design signals

Agent-ready CLIs expose evidence for navigation capabilities.

Agents cannot reliably navigate anything. They can try, but reliability falls when a CLI lacks stable discovery, explicit operands, parseable outputs, actionable errors, and safe diagnostic behavior.

Discovery

Canonical help everywhere

Every real command should support direct --help without side effects.

Grammar

Stable usage syntax

Required and optional operands should appear in usage lines and examples.

Output

Parseable result modes

Use --json or --format json for command results agents need to consume.

Recovery

Actionable diagnostics

Invalid commands, flags, missing operands, auth, and context errors should say what is missing.

Preconditions

Hidden state is explicit

Auth, project directories, fixtures, network, and daemon requirements should be visible before execution.

Safety

Discovery is read-only

Help, version, and diagnostic probes should not create caches, configs, credentials, or other persistent files.

Sources

Research used by the CLIARE model and evaluation plan.