Inferoa
Inferoa is an Inference-native Tokenmaxxing Agent Harness for Loop Engineering. It is built for recursive long-horizon goals in coding and research work across the vLLM ecosystem.
That is what inference-native means here: Inferoa starts from the inference stack and co-designs loop engineering around tokenmaxxing: prefix-cache discipline, context optimization, intelligent routing through vLLM Semantic Router, high-throughput vLLM serving, vLLM Omni multimodal capability, and RTK/CodeGraph-backed context selection.
Most agents treat inference as a black-box chat API. Inferoa starts from the opposite direction: the agent loop is designed around the optimization surfaces that modern inference systems expose. Long sessions, prefix-cache discipline, context pressure, model routing, self-hosted serving signals, multimodal artifacts, and verification belong to one durable harness.
What Inferoa Coordinates
- Goal-driven loop engineering keeps recursive long-horizon goals, horizons, completion reflection, and evidence attached to a durable session.
- Plan and autoresearch modes support approved scope and repeated measurement inside the same long-running loop.
- Prefix-cache discipline keeps the stable parts of the prompt stable, so reusable prefixes are not invalidated by avoidable churn.
- Context optimization selects the evidence needed for the next turn using summaries, code intelligence, bounded tool output, and RTK.
- Intelligent routing can choose model paths by cost, safety, privacy, capability, and current session pressure.
- Self-hosted serving uses vLLM Engine-compatible endpoints as first-class inference surfaces instead of opaque chat backends.
- Multimodal execution routes image, video, audio, and speech work through vLLM-Omni-compatible endpoints and stores produced media as managed artifacts.
Why Coding First
Coding is a high-pressure long-horizon task: large repositories, tool failures, context limits, repeated model calls, and proof through tests all appear in the same workflow. That makes it a strong first domain for co-designing agent behavior with inference behavior.
Documentation Map
- Quickstart when you want to run Inferoa.
- Architecture for the system model.
- Tokenmaxxing, Context optimization, and Prefix cache for the core disciplines.
- Model endpoints, vLLM Omni, and Context and RTK for configuration.
- Goal mode, Plan mode, Autoresearch mode, Coding workflow, and Daemon runs for long-horizon workflows.
- Acceptance and Evidence and sessions for release validation.
- CLI reference, Slash commands, and Configuration reference when you need exact command or key names.
Current Implementation
Inferoa is a TypeScript and Node.js terminal application. It stores local state
under ~/.inferoa/ by default and keeps raw endpoint secrets in the local vault
instead of plain configuration files. Node.js 24 or newer is required; the npm
package is published as inferoa from the agentic-in organization.