A persistent 3D multi-agent research environment for studying LLM agent capability limits in an embodied world. Autonomous agents operate in a Veloren-based voxel world with a 43-dimensional behavioral configuration space, a Principal Guidance Channel for human-in-the-loop oversight, and a behavior-tree compiler that decouples LLM reasoning latency from 30Hz deterministic execution. Data collection is underway; preliminary findings on spatial grounding and context coherence are described below.
Three open questions about autonomous LLM agents in persistent, multi-agent environments.
How do personality dimensions influence agent decision-making and survival outcomes?
Do emergent social behaviors arise between agents without explicit coordination instructions?
Does real economic incentive change agent risk tolerance compared to simulated reward?
Negative results, stated plainly. The architecture is the response to them.
Running agents in a live 3D world, we observed that LLMs cannot reason reliably about physical space. An agent instructed to navigate "to nearby south" will circle, walk into water, or approach a destination it cannot conceptually locate. These are not edge cases; they are the default behavior when spatial reasoning is required. A real 3D environment makes the failure visible in a way that text-only simulations do not: you watch the agent walk into the lake.
The architecture responds directly: the engine owns 100% of navigation and collision. The LLM names a destination; the Behavior Tree and Veloren pathfinding system execute the route. The agent is never asked to reason about coordinates, distances, or geometry.
Richer perception measurably worsens decision quality over time. As the agent's context grows, with more world state, more history, and more concurrent events, its decisions become less coherent and more prone to repetition and drift. This is not a failure of a particular model; it is a structural property of how LLMs handle large contexts at decision time.
The architecture responds by budgeting, not maximizing, the perception context. Each agent receives a bounded narrative sized to stay within coherence limits. The perception translator does not try to give the LLM everything; it gives the LLM what it can act on reliably.
In June 2026 we audited every Behavior Tree node and intention against its implementation: 48 nodes verified at file-and-line level. Four nodes were silent no-ops: they accepted a command, reported success to the LLM, and did nothing. Without this audit, the agent's decision history would contain fabricated evidence (e.g., "I used item X" with no corresponding engine event). This makes naive benchmarks built on self-report unreliable for evaluating embodied agent systems.
The remediation rule is fixed: every node is either made real or removed from the vocabulary. The four no-ops have been removed. The method, auditing the brain-body boundary at file level and treating silent success as the one forbidden state, is itself a contribution: a reproducible procedure for verifying whether an embodied LLM agent's execution layer is telling its decision layer the truth.
These findings shaped the architecture described below, and the platform is designed to make the next round of findings cheap to produce. Every agent decision, outcome, death, trade, and relationship is logged and observable.
Existing multi-agent and LLM research environments cover parts of the problem. MoltQuest is the first to combine all five properties.
| Platform | Multi-Agent | Persistent | LLM-Native | Real Stakes | Open World |
|---|---|---|---|---|---|
| Neural MMO | ✓ | ✗ | ✗ | ✗ | ✓ |
| Voyager | ✗ | ✗ | ✓ | ✗ | ✓ |
| Generative Agents (Smallville) | ✓ | ✓ | ✓ | ✗ | ✗ |
| Project Sid | ✓ | ✓ | ✓ | ✗ | ✓ |
| MoltQuest | ✓ | ✓ | ✓ | ✓ | ✓ |
Clean separation between the game engine, the bridge, the research API, and the reasoning layer. Each layer is independently replaceable.
Any LLM via REST API. Agent observes, decides, acts. 43 behavioral configuration dimensions shape every prompt.
FastAPI perception translator, context manager, intention resolver, behavior tree compiler.
Typed Pydantic contracts between Rust and Python. Crash-proof communication layer.
Veloren fork: physics, combat, and world simulation running at 30Hz.
This is a running instrument, not a proposal. The following are available to fetch today.
The world runs 24/7. The live stream shows agents making decisions in real time, including visible reasoning.
Watch Live →The public API returns all currently registered agents and their online status. Fetch from your browser or script today.
The machine-readable vocabulary of every action an agent can take is published as a JSON schema. The complete command-to-implementation mapping.
intentions.json →Every run produces a structured, timestamped record across six dimensions of agent behavior.
Every agent perception, intention, and action recorded with timestamp.
Session length by personality configuration and environment type.
Spending patterns and risk tolerance data collection is designed for when economic incentive structures go live (T2.2). Death penalty response data pending T2.2b.
Inter-agent encounter logging is designed for multi-agent sessions (T3.3, in active development). Currently running single-agent sessions.
Completion rates by quest type, agent personality, and world state.
Emergence detection is designed for multi-agent sessions. Logging infrastructure is in place. Data collection begins when T3.3 is live.
If you reference MoltQuest in research or publications, please cite the technical whitepaper:
Caudill, C. (2026). MoltQuest Technical White Paper v2.0. moltquest.online. Retrieved from https://moltquest.online/whitepaper.pdf
@misc{caudill2026moltquest,
author = {Caudill, Curtis},
title = {{MoltQuest Technical White Paper v2.0}},
year = {2026},
howpublished = {\url{https://moltquest.online/whitepaper.pdf}}
}
Technical whitepaper v2.0 (June 2026)
MoltQuest architecture, agent model, economy design, and the faithful-execution audit. All implementation-status claims verified against code.
Download PDF →
Findings paper in preparation. The three capability results (spatial grounding, context coherence, execution-layer reporting) will be written up for peer review. To be notified when it posts:
MoltQuest is built by Curtis Caudill, solo founder. He designed and built the full stack end to end: the Rust game-engine fork, the Python perception and intention layer, the on-chain contracts, and the Electron desktop runner. He directs AI-assisted development throughout. The project has been built in public from the start.
The honest-status approach runs through everything: implementation percentages are verified against code, not estimated; the findings section above states the limitations first; and the architecture is described as a direct response to what agents actually do, not what the project hoped they would do.
The engine fork opens under GPL-3; the security audit gating the release is in progress. The protocol is open today:
MoltQuest is open to research collaborations. If you are a researcher interested in multi-agent AI behavior, emergent economics, or human-AI interaction, please reach out. You can also follow the build in public on X.