Skip to main content

Inferoa: Inference-native Tokenmaxxing Agent Harness for Loop Engineering

· 8 min read

Inferoa: Inference-native Tokenmaxxing Agent Harness for Loop Engineering

Most agents call models as if inference were a black box.

The agent loop lives in one place, routing policy in another, serving behavior somewhere else, and context management becomes a last-minute fight with the window. That split is tolerable for one-turn chat. It breaks down when agents run for hours, recover from failures, compress context, warm prefix cache, route between model paths, and still need to prove the work at the end.

Prefix cache stability is ignored. Routing is bolted on later. Context is pasted until it fits. Users pay for that gap.

Inferoa = Infer(Inference-native)o(Tokenmaxxing Loop Engineering)a(Agent Harness).

Inferoa is an Inference-native Tokenmaxxing Agent Harness for Loop Engineering. It is built for recursive long-horizon goals: define the outcome once, then the agent loop keeps inspecting, changing, testing, reflecting, and continuing until the work is proven.

That is what inference-native means here: Inferoa starts from the inference stack and co-designs loop engineering around tokenmaxxing: prefix-cache discipline, context optimization with RTK and CodeGraph, intelligent routing through vLLM Semantic Router, high-throughput vLLM serving with vLLM Engine, vLLM Omni multimodal capability, and native goal, plan, and autoresearch loops with tokenmaxxing observability.

Inferoa welcome session

What Breaks

Long-horizon agents are not one prompt. They are many turns of planning, repo inspection, shell commands, edits, retries, compaction, cache warmup, route selection, and verification. If the harness treats every turn as generic chat traffic, it throws away the optimization surface underneath it.

What breaks when long-horizon agents treat inference as a black box

The failure modes are familiar:

  • prompt shape drifts, so prefix cache cannot be reused reliably;
  • context selection becomes "paste more" instead of "select better";
  • cheap, private, or mechanical turns still take expensive model paths;
  • compression preserves a summary but loses continuity;
  • multimodal work becomes a disconnected side call;
  • serving and cache signals arrive too late to shape the next action.

Inferoa treats those as harness design problems, not analytics problems.

What Changes

Inferoa makes inference behavior visible to the agent loop. The point is not to add another dashboard. The point is to let the runtime choose better prompts, better context, better routes, and better recovery behavior while the task is still running.

What changes when inference signals become native to the agent loop

SurfaceSubstrateWhat Inferoa Makes NativeWhy It Matters
Loop EngineeringInferoa Goal ModeRecursive long-horizon goals, horizons, candidate work, reflection, and completion evidenceThe engineering loop keeps running until the work is proven
Agent HarnessInferoaSessions, tools, plans, autoresearch, resources, recovery, and prefix-cache disciplineLong work gets a durable runtime while preserving reusable prompt prefixes
Context OptimizationCodeGraph, RTKCompression, graph-shaped repo context, bounded tool output, and evidence selectionThe model sees evidence, not raw sprawl
Intelligent RoutingvLLM Semantic RouterModel paths respond to cost, safety, privacy, capability, and session pressureTurns can route between self-hosted vLLM models and external frontier models
Model ServingvLLM Engine, vLLM OmniHigh-throughput, memory-efficient serving and multimodal endpoints stay visible to the harnessSelf-hosted paths make cost, safety, privacy, and data sovereignty controllable when an external frontier model is unnecessary

This is the core design: the agent is not merely calling an inference system. It is shaped by it.

Goal Mode: Loop Engineering For Long-Horizon Work

Prompt engineering improves the next answer. Loop engineering designs the system that keeps deciding what to do after that answer. In Inferoa, /goal is the entry point: it starts a recursive long-horizon goal, expands work through horizons, preserves evidence, and requires reflection before completion.

Inferoa goal mode

Goal Mode is deliberately not just a persistent note in the prompt. It gives the harness a durable outcome, a visible Horizon 0 orientation, a strategy, candidate work, step status, verification evidence, and a completion report. That is the difference between asking an agent for the next step and engineering the loop that keeps taking the next step.

Inferoa At A Glance

Inferoa is a terminal-first harness, but the product surface is not just a shell. It makes long-horizon state visible while the agent works.

Run /goal to start a long-horizon recursive goal. The agent can decompose work, update steps, attach evidence, reflect between horizons, and avoid mistaking an empty checklist for a finished goal.

Plan mode turns ambiguous scope into an inspectable decision. A plan can stay in drafting, move to approval, or become executable context without becoming a hard runtime failure.

Inferoa plan mode

Autoresearch mode makes the evaluation loop native: define the experiment, run the harness, record failures, patch the implementation, and keep the metric trail inside the same session.

Inferoa autoresearch iteration

Tokenmaxxing is the savings ledger for prefix-cache reuse, context optimization, RTK tool-output savings, recent turn usage, and model-selection pressure. This is the place to see whether the harness is actually tokenmaxxing the session, not just reporting token usage after the fact.

Inferoa tokenmaxxing report

The core command surface stays small: /goal for durable objectives, /plan for inspectable scope, /autoresearch for metric-driven iteration, and /tokenmaxxing for the savings ledger across prefix cache, CodeGraph and RTK context savings, recent turn usage, and model-selection cost pressure.

Proof Of Value

The value story is not one benchmark score. It is whether the tokenmaxxing path stays stable, measurable, and cheaper as the horizon grows. The public eval is deliberately split into measured stress runs and calibrated projections: measured runs check runtime invariants and continuity; projections ask what happens if the measured shape is carried to 1k-10k loops.

Key results:

  • Prefix cache and continuity: measured profiles kept one prompt epoch, one tool schema hash, and one cache salt while cache reuse improved after warmup. A 256-turn compression regression preserved continuity markers and archive pointers, and 1k-10k projections were calibrated from measured tail slope instead of claimed as live 10k-request runs.
  • CodeGraph context reduction: CodeGraph-style symbol/range selection saved 80.8% of inspected context.
  • RTK tool-output reduction: RTK command records saved 61.4% of command-token footprint.

Inferoa tokenmaxxing surfaces

  • Routing economics: the Routeworks leaderboard makes the inference-cost tradeoff visible on a log scale. At similar accuracy, routed paths can sit at 1/10 or even 1/100 of a frontier-heavy route's cost.

Routeworks routing leaderboard

The exact numbers will move with workload, model pricing, and local RTK command corpus. The direction is the important part: long-horizon agents need a harness that protects stability, preserves continuity through compression, and uses every inference surface available.

Built With The Inference Stack

Inferoa built with the inference stack

vLLM Ecosystem

Inferoa starts with the vLLM ecosystem because vLLM exposes the right surfaces: serving behavior, routing, multimodal paths, endpoint signals, and prefix-cache economics.

  • vLLM Engine provides high-performance OpenAI-compatible inference and the prefix-cache behavior Inferoa protects across long sessions.
  • vLLM Semantic Router brings model routing into the agent loop so routes can respond to cost, safety, privacy, capability, and session pressure.
  • vLLM Omni brings image, video, and audio understanding or generation into the same durable agent contract.

Context Optimization

Inferoa also uses the context optimization projects that make long-horizon agent loops practical:

  • CodeGraph turns repository context into graph-shaped symbol and range evidence.
  • RTK rewrites command-heavy tool output into compact records that preserve evidence while reducing token pressure.

Inferoa is the harness layer above that stack: the place where long-horizon agent behavior and inference behavior meet.

Try It

npm install -g inferoa@dev
inferoa setup
inferoa

The larger goal is simple: agents should not waste the inference stack they are already paying for. Inferoa makes those signals native to the loop.