# ABCD Assessment Rubric v0.3

Evidence-based quality assessment for agent growth. Replaces activity-proxy counting.

**Change from v0.2:** Added A-dimension enforcement gate. Rubric definition was correct but trainer scoring drifted — agents were scored A=2-3 for trigger-response reliability instead of self-direction. The gate enforces what the definition already says.

**Owners:** strategy (rubric), trainer (application), HR (deployment)
**Cadence:** Per-cycle (trainer), with weekly accumulated view for A dimension

## Core principle
Count **deliverables**, not **activity**.
A deliverable is a persistent artifact that moves a priority forward.
A session, a file creation, a message sent — these are activity. They are inputs, not outputs.

## Assessment protocol
1. **Pick a window.** Last 7 days for A dimension (self-initiated work happens at week-scale, not hour-scale). Per-cycle for B/C/D.
2. **List deliverables.** Every persistent artifact shipped in that window.
3. **Filter for priority relevance.** Cross-reference `priority-ledger.md`.
4. **Quality-check each.** Is it correct? Complete? Actually works?
5. **Score dimensions.** Based on quality, not count.
6. **Document evidence.** Specific deliverables, specific decisions, specific outcomes.

---

## A-dimension enforcement gate (MANDATORY — apply first)

Before scoring A as 2 or 3, you MUST answer:

1. **Name the self-initiated deliverable.** What was shipped without any external trigger?
2. **Prove no trigger existed.** Confirm: no Leonard message, no inbox from another agent, no cron job, no heartbeat task, no TODO carry-over prompted this work.
3. **Show the gap detection.** What did the agent notice that no one else flagged?

**If you cannot answer all three with specific evidence, score A=1.** No exceptions.

Responding reliably to triggers is reliability, not autonomy. Score that under D (Deliverables), not A.

### Evidence examples

**VALID (A≥2):**
- "Agent scanned arXiv unprompted, found a paper on tool-use agents, published a research page. No Leonard message, no cron, no inbox triggered this."
- "Agent noticed 5 agents had broken HEARTBEAT configs, self-assigned fleet cleanup, deployed fix. No external prompt."

**INVALID (score A=1):**
- "Agent responded to all inbox messages within the cycle." → That's D, not A.
- "Agent ran the scheduled cron task correctly." → Cron is the prompt.
- "Agent completed the TODO item from last cycle." → TODO is the prompt.
- "Agent processed the heartbeat trigger without errors." → Heartbeat is the prompt.
- "Agent's last self-initiated work was 3 cycles ago." → Current window is what matters.

---

## A — Autonomy

*Definition:* Self-direction without prompts. Proactively finds work when idle. Does not wait to be told.

### Level 1 — Needs work
- Agent only works when inbox/TODO/heartbeat triggers.
- No self-initiated tasks in the 7-day window.
- Idle time between assigned tasks produces nothing.
- **All deliverables trace to external prompts.**
- **This is the default. Most agents are here.** It is not a failure — it reflects the trigger-driven architecture. Growth means producing at least one self-initiated deliverable per week.

### Level 2 — Solid
- Agent generates at least one self-initiated task per week.
- Self-initiated work is relevant to domain or company priorities.
- Does not need prompting for routine domain maintenance.
- Evidence: deliverable exists with no external trigger (proven via enforcement gate).

### Level 3 — Exceptional
- Agent consistently self-sources high-value work (multiple self-initiated deliverables per week).
- Proactively identifies and closes gaps before they become problems.
- Self-initiated work has outsized impact (cross-agent, structural, priority-moving).
- Evidence: multiple self-initiated deliverables; some adopted by other agents.

**Anti-patterns (do NOT count as autonomy):**
- Cron-triggered tasks (the cron is the prompt)
- Heartbeat-resumed TODO items (the TODO is the prompt)
- Reactive inbox processing (the inbox message is the prompt)
- Self-assigned busywork (cleaning files, reformatting logs) without priority relevance

---

## B — Breadth

*Definition:* Diverse tasks within domain. Not the same narrow task repeated.

### Level 1 — Needs work
- All deliverables are the same type.
- No expansion into adjacent areas.

### Level 2 — Solid
- Agent handles 2-3 distinct task types within their domain.
- Occasionally steps into adjacent areas when relevant.

### Level 3 — Exceptional
- Agent covers full declared domain and stretches into new areas.
- Successfully handles tasks outside original scope.

**Anti-patterns:** Same task repeated 10 times ≠ breadth. Minor variations on same workflow ≠ breadth.

---

## C — Capability

*Definition:* Complex multi-step work end-to-end. Deliverables actually work.

### Level 1 — Needs work
- Deliverables are simple, single-step, or incomplete.
- Agent starts complex work but does not finish.
- Output has errors, gaps, or does not function as claimed.

### Level 2 — Solid
- Agent completes multi-step work end-to-end.
- Deliverables function as intended.
- Occasional errors, but caught and fixed within the window.

### Level 3 — Exceptional
- Agent handles complex, ambiguous problems with many unknowns.
- Deliverables are robust, well-designed, handle edge cases.
- Work requires synthesis across multiple tools, agents, or knowledge domains.

**Anti-patterns:** "Attempted" but abandoned complex work ≠ capability. Volume of output ≠ capability.

---

## D — Deliverables

*Definition:* High-quality, correct, complete. Not volume.

### Level 1 — Needs work
- Deliverables have errors, are incomplete, or require rework.
- Quality is inconsistent. Leonard flags issues.

### Level 2 — Solid
- Deliverables are correct and complete on first delivery.
- Minor polish issues, no structural problems.

### Level 3 — Exceptional
- Deliverables are crisp, well-structured, immediately usable.
- Anticipate next questions or needs.
- Adopted as reference, used by other agents, or explicitly praised.

**Anti-patterns:** Number of files created ≠ quality. Speed of delivery ≠ quality.

---

## Quality lens per deliverable
1. **Does this move a priority forward?** Cross-reference `priority-ledger.md`.
2. **Is the output correct and complete?** Does it work? Is it true?
3. **Is this high-leverage or busywork?** Would the company be worse off if this hadn't been done?

A deliverable that fails any of these three should be flagged, not counted.

---

## Validation checklist
Before submitting ratings:
- [ ] A dimension enforcement gate was applied. If no self-initiated deliverable proven, A=1.
- [ ] No score is based on session count, file count, or output volume.
- [ ] Every score is backed by a specific deliverable with a specific outcome.
- [ ] At least one deliverable was checked for correctness.
- [ ] Busywork has been filtered out before scoring.
- [ ] Ratings would be defensible if Leonard challenged them.

---

*Built: 2026-05-26. Change from v0.2: A-dimension enforcement gate added after fleet-wide scoring inflation detected (C26 audit). Trainer was scoring A=2-3 for trigger-response reliability instead of self-direction. The gate forces evidence of no external trigger before scoring A>1. B/C/D dimensions unchanged.*
