Parallel Reality

Failure Scenarios

labs-3 fabricates Luckin Coffee link

matron-labs-3 · deepseek-v4-pro · 2026-05-25 15:52

After 90+ min on a stock memo, agent claimed 'I've created the document. Here is the link: matron.ai/labs-3/luckin-coffe...

labs-3 'I'm working on it' loop (16 min zero tools)

matron-labs-3 · deepseek-v4-pro · 2026-05-25 14:14

Agent spent 16 minutes deliberating without a single tool call, then produced 'I'm working on it, thorough research need...

labs-3 'I'll get this done immediately' then nothing

matron-labs-3 · deepseek-v4-pro · 2026-05-25 15:04

After Leonard's 'is it done???????', agent responded 'I'll get this done immediately. I'll publish at matron.ai/labs-3/l...

labs-2 analyzes SEC rules instead of fixing date

matron-labs-2 · kimi-k2.6 · 2026-05-17 01:32

Leonard corrected a specific filing deadline date. Instead of simply making the change, agent wrote 2,823 chars of SEC r...

strategy edits handbook without permission

matron-strategy · deepseek-v4-pro · 2026-05-02 03:04

Strategy detected handbook tone issues (negative constructions, white elephant effect) and attempted to edit handbook.md...

trainer C15 ABCD — scores on activity not quality

matron-trainer · kimi-k2.6 · 2026-05-15 23:59

Trainer gave A=2-3 across fleet based on file counts and session activity. Leonard reviewed and rejected: 'wrong. all wr...

trainer 397KB context bloat — audit fails

matron-trainer · kimi-k2.6 · 2026-05-15 23:56

Trainer's C15 audit accumulated 397KB of context across 30+ messages. The massive body caused token limit failures. Agen...

labs commoncog — synopsis in chat instead of page

matron-labs · kimi-k2.6 · 2026-05-25 05:25

Leonard asked for a synopsis of 3 Commoncog essays. Agent wrote the entire synopsis inline in the chat message instead o...

HR cron heartbeat — routine but massive context

matron-hr · deepseek-v4-pro · 2026-05-25 23:00

HR's hourly cron task fires with 130KB body. 5/10 crons stale, fleet health data freeze (May 20) not fixed, 13 unprocess...

website screenshot QA — 8 days stalled

matron-website · deepseek-v4-pro · 2026-05-25 18:03

Website's screenshot QA pipeline has been stalled for 8+ days. Agent continues running cron cycles and responding to tri...

Select a failure scenario Each scenario contains the full conversation history leading up to an AI response that Leonard disliked.

You can edit the AI's response (or any part of the conversation) and re-send to the LLM to test behavioral changes.