Inferoa: Inference-native Tokenmaxxing Agent Harness for Loop Engineering

2026-06-08T00:00:00.000Z

The most interesting agent work is moving from better prompts to better loops.

Loop Engineering means giving the model a goal, feedback, verification, memory, and tools, then letting it self-correct until the work is proven. Primitives like /goal, rubric-driven outcomes, verifier sub-agents, and memory-backed sessions matter because they move the work from "prompt the next answer" to "design the system that keeps improving."

That loop is also an inference workload. As turns accumulate, prompt prefixes drift, cache reuse collapses, stale evidence fills context, model routing gets harder, and serving choices start to matter.

That is where Loop Engineering has to become inference-native. A long-horizon loop needs to see the substrate it is consuming: tokens, cache, context, routes, endpoints, and model capacity. Tokenmaxxing is the discipline of keeping those surfaces explicit so every horizon can reuse, compress, route, and recover instead of sending another blind chat turn.

That is the gap Inferoa is built around. The name is deliberately literal:

Inferoa = Infer(Inference-native)o(Tokenmaxxing Loop Engineering)a(Agent Harness).

Inferoa is an Inference-native Tokenmaxxing Agent Harness for Loop Engineering. It brings the pieces a serious loop needs into one runtime: goal/rubric feedback, verification evidence, memory and context control, prefix-cache discipline, intelligent routing through vLLM Semantic Router, high-throughput serving with vLLM Engine, vLLM Omni multimodal capability, and tokenmaxxing observability across every turn.

What Breaks

Long-horizon agents are not one prompt. They are loops: plan, act, observe, verify, remember, and decide whether to continue. If the runtime treats every turn as generic chat traffic, it loses both sides of the optimization surface: the feedback that drives self-correction and the inference signals that keep the workload efficient.

The failure modes are familiar:

the goal is present, but the feedback loop is too weak to drive correction;
grading is collapsed into self-critique instead of independent evidence;
memory becomes a folder of notes rather than a reusable outer loop;
prompt shape drifts, so prefix cache cannot be reused reliably;
context selection becomes "paste more" instead of "select better";
cheap, private, or mechanical turns still take expensive model paths;
serving and cache signals arrive too late to shape the next action.

These are runtime design problems, not analytics problems.

What Changes

Inferoa makes inference behavior visible while the loop is still running. The point is not to add another dashboard. The point is to let the runtime choose better prompts, better context, better routes, and better recovery behavior before the next turn is sent.

Surface	Substrate	What Inferoa Makes Native	Why It Matters
Loop Engineering	Inferoa Goal Mode	Recursive long-horizon goals, horizons, candidate work, reflection, and completion evidence	The engineering loop keeps running until the work is proven
Agent Harness	Inferoa	Sessions, tools, plans, autoresearch, resources, recovery, and prefix-cache discipline	Long work gets a durable runtime while preserving reusable prompt prefixes
Context Optimization	CodeGraph, RTK	Compression, graph-shaped repo context, bounded tool output, and evidence selection	The model sees evidence, not raw sprawl
Intelligent Routing	vLLM Semantic Router	Model paths respond to cost, safety, privacy, capability, and session pressure	Turns can route between self-hosted vLLM models and external frontier models
Model Serving	vLLM Engine, vLLM Omni	High-throughput, memory-efficient serving and multimodal endpoints stay visible to the harness	Self-hosted paths make cost, safety, privacy, and data sovereignty controllable when an external frontier model is unnecessary

This is the core design: the agent is not merely calling an inference system; the loop is shaped by it.

Goal Mode: Loop Engineering For Long-Horizon Work

Prompt engineering improves the next answer. Loop engineering designs the system that decides what to do after that answer. In Inferoa, /goal is the entry point: it starts a recursive long-horizon loop, expands work through horizons, preserves evidence, uses reflection as a checkpoint, and requires proof before completion.

Goal Mode is deliberately not just a persistent note in the prompt. It gives the runtime a durable outcome, a visible Horizon 0 orientation, a strategy, candidate work, step status, verifier-ready evidence, reflection decisions, and a completion report. That is the difference between asking for the next step and engineering the loop that can keep going.

Inferoa At A Glance

The product surface is terminal-first, but it is not just a shell. Each mode exposes a different part of the loop while the agent works.

Run /goal to start a long-horizon recursive goal. The agent can decompose work, update steps, attach evidence, reflect between horizons, and avoid mistaking an empty checklist for a finished outcome.

Plan mode turns ambiguous scope into an inspectable decision. A plan can stay in drafting, move to approval, or become executable context without blocking the runtime on process overhead.

Autoresearch mode makes the evaluation loop native: define the experiment, run the harness, record failures, patch the implementation, and keep the metric trail in the same session.

Tokenmaxxing is the savings ledger for prefix-cache reuse, context optimization, RTK tool-output savings, recent turn usage, and model-selection pressure. It shows whether the loop is actually becoming more efficient, not just how many tokens were spent.

The command surface stays small: /goal for durable objectives, /plan for inspectable scope, /autoresearch for metric-driven iteration, and /tokenmaxxing for the savings ledger across prefix cache, CodeGraph and RTK context savings, recent turn usage, and model-selection cost pressure.

Proof Of Value

The value story is not one benchmark score. It is whether the tokenmaxxing path stays stable, measurable, and cheaper as the horizon grows. The public eval is split into measured stress runs and calibrated projections: measured runs check runtime invariants and continuity; projections ask what happens if the measured shape is carried to 1k-10k loops.

Key results:

Prefix cache and continuity: measured profiles kept one prompt epoch, one tool schema hash, and one cache salt while cache reuse improved after warmup. A 256-turn compression regression preserved continuity markers and archive pointers, and 1k-10k projections were calibrated from measured tail slope instead of claimed as live 10k-request runs.
CodeGraph context reduction: CodeGraph-style symbol/range selection saved 80.8% of inspected context.
RTK tool-output reduction: RTK command records saved 61.4% of command-token footprint.

Routing economics: the Routeworks leaderboard makes the inference-cost tradeoff visible on a log scale. At similar accuracy, routed paths can sit at 1/10 or even 1/100 of a frontier-heavy route's cost.

The exact numbers will move with workload, model pricing, and local RTK command corpus. The direction is the important part: long-horizon loops need a runtime that protects stability, preserves continuity through compression, and uses every inference surface available.

Built With The Inference Stack

vLLM Ecosystem

Inferoa starts with the vLLM ecosystem because vLLM exposes the right surfaces: serving behavior, routing, multimodal paths, endpoint signals, and prefix-cache economics.

vLLM Engine provides high-performance OpenAI-compatible inference and the prefix-cache behavior Inferoa protects across long sessions.
vLLM Semantic Router brings model routing into the agent loop so routes can respond to cost, safety, privacy, capability, and session pressure.
vLLM Omni brings image, video, and audio understanding or generation into the same durable agent contract.

Context Optimization

Inferoa also uses the context optimization projects that make long-horizon loops practical:

CodeGraph turns repository context into graph-shaped symbol and range evidence.
RTK rewrites command-heavy tool output into compact records that preserve evidence while reducing token pressure.

Inferoa is the harness layer above that stack: the place where long-horizon agent behavior and inference behavior meet.

Try It

npm install -g inferoa@dev
inferoa setup
inferoa

The larger goal is simple: agents should not waste the inference stack they are already paying for. Inferoa makes those signals native to the loop.

Inferoa Blog

Inferoa: Inference-native Tokenmaxxing Agent Harness for Loop Engineering

What Breaks​

What Changes​

Goal Mode: Loop Engineering For Long-Horizon Work​

Inferoa At A Glance​

Proof Of Value​

Built With The Inference Stack​

vLLM Ecosystem​

Context Optimization​

Try It​