Inferoa

Inferoa is an Inference-native Tokenmaxxing Agent Harness for Loop Engineering. It is built for recursive long-horizon loops in coding and research work across the vLLM ecosystem.

That is what inference-native means here: Inferoa starts from the inference stack and co-designs loop engineering around tokenmaxxing: prefix-cache discipline, context optimization, intelligent routing through vLLM Semantic Router, high-throughput vLLM serving, vLLM Omni multimodal capability, and RTK/CodeGraph-backed context selection.

Most agents treat inference as a black-box chat API. Inferoa starts from the opposite direction: the agent loop is designed around the optimization surfaces that modern inference systems expose. Long sessions, prefix-cache discipline, context pressure, model routing, self-hosted serving signals, multimodal artifacts, and verification belong to one durable harness.

What Inferoa Coordinates

Loop-driven engineering keeps recursive long-horizon loops, loop tasks, completion decisions, and evidence attached to a durable session.
Plan mode and research loops support approved scope and repeated measurement inside the same long-running loop.
Prefix-cache discipline keeps the stable parts of the prompt stable, so reusable prefixes are not invalidated by avoidable churn.
Context optimization selects the evidence needed for the next turn using summaries, code intelligence, bounded tool output, and RTK.
Intelligent routing can choose model paths by cost, safety, privacy, capability, and current session pressure.
Self-hosted serving uses vLLM Engine-compatible endpoints as first-class inference surfaces instead of opaque chat backends.
Multimodal execution routes image, video, audio, and speech work through vLLM-Omni-compatible endpoints and stores produced media as managed artifacts.

Why Coding First

Coding is a high-pressure long-horizon task: large repositories, tool failures, context limits, repeated model calls, and proof through tests all appear in the same workflow. That makes it a strong first domain for co-designing agent behavior with inference behavior.

Documentation Map

Quickstart when you want to run Inferoa.
Architecture for the system model.
Tokenmaxxing, Context optimization, and Prefix cache for the core disciplines.
Model endpoints, vLLM Omni, and Context and RTK for configuration.
Loop mode, Plan mode, Coding workflow, and Daemon runs for long-horizon workflows.
Acceptance and Evidence and sessions for release validation.
CLI reference, Slash commands, and Configuration reference when you need exact command or key names.

Current Implementation

Inferoa is a TypeScript and Node.js terminal application. It stores local state under ~/.inferoa/ by default and keeps raw endpoint secrets in the local vault instead of plain configuration files. Node.js 24 or newer is required; the npm package is published as inferoa from the agentic-in organization.

What Inferoa Coordinates​

Why Coding First​

Documentation Map​

Current Implementation​

What Inferoa Coordinates

Why Coding First

Documentation Map

Current Implementation