Architecture
Inferoa is organized around loop engineering on the tokenmaxxing path:
Agent Harness
The harness owns durable sessions, recursive goals, plans, research state, tool traces, managed resources, recovery, and verification. These are not just UI features; they define the state that long-horizon inference has to preserve.
Goal mode is the loop-engineering surface: it keeps the outcome, horizons, candidate work, evidence, reflection, and completion report active while the agent continues working.
Prefix Cache Discipline
Prefix cache is a core design target. Inferoa protects it with stable prompt epochs, deterministic tool schema ordering, bounded mutable context, and separate warmup versus steady-state cache reporting. See Prefix cache for the prompt-epoch model.
Context Optimization
Context optimization selects what the next turn actually needs. CodeGraph, RTK, summaries, symbols, resources, and tool outputs are used to reduce token waste while preserving the evidence required for accurate coding work. See Context optimization for the lifecycle and defaults.
Intelligent Routing
Routing is part of agent policy. vLLM Semantic Router can choose model paths by
cost, safety, privacy, capability, and session pressure instead of forcing
every turn through the same model. See
Model endpoints for the configuration
shape and the auto mode that delegates routing to vLLM Semantic Router.
Self-Hosted Model Serving
vLLM Engine provides high-performance OpenAI-compatible serving. Inferoa treats usage, cache behavior, endpoint capability, and request metadata as signals the agent can surface and eventually act on.
Multimodal
vLLM Omni extends the self-hosted serving path with image, video, and audio
understanding or generation. Multimodal outputs are tracked as session
artifacts instead of disconnected side calls. See vLLM Omni
for configured endpoints and /doctor for user-facing endpoint health.