diff --git a/agents/agent-evaluator.md b/agents/agent-evaluator.md index 04317118..c44242ba 100644 --- a/agents/agent-evaluator.md +++ b/agents/agent-evaluator.md @@ -203,4 +203,4 @@ TOP IMPROVEMENTS: 3. [Completeness] Handle 429, connection errors, and timeout VERDICT: Redo with specific fixes. Weakest axis: Accuracy (2/5). -``` \ No newline at end of file +``` diff --git a/skills/agent-self-evaluation/SKILL.md b/skills/agent-self-evaluation/SKILL.md index 0e1a2fd6..4e241380 100644 --- a/skills/agent-self-evaluation/SKILL.md +++ b/skills/agent-self-evaluation/SKILL.md @@ -85,7 +85,7 @@ If any axis scored 3 or below: 1. State what you would do differently 2. If the gap is fixable in < 30 seconds (missing link, unclear phrasing), fix it now -3. If the gap requires rework, flag it explicitly: "This axis scored __ because __. Re-running with __ would likely raise it to __." +3. If the gap requires rework, flag it explicitly: "This axis scored [reason] because [evidence]. Re-running with [specific fix] would likely raise it to [score]." ## Code Examples @@ -134,7 +134,7 @@ Overall: 2.8 — Wrong library used. Needs httpx rewrite. ### "Everything is a 5" ``` -❌ Accuracy: 5 — All good. +FAIL: Accuracy: 5 — All good. Completeness: 5 — Everything covered. Clarity: 5 — Clear. ``` @@ -144,7 +144,7 @@ No evidence cited. This is self-congratulation, not evaluation. A real 5 require ### Over-penalizing for scope creep ``` -❌ Completeness: 2 — Didn't handle WebSocket connections or +FAIL: Completeness: 2 — Didn't handle WebSocket connections or gRPC streaming (user didn't ask for these) ``` @@ -153,7 +153,7 @@ Only evaluate against what the user actually requested, not what you could have ### Using the evaluation to re-litigate ``` -❌ "As I said earlier, this approach is wrong. Score: 1" +FAIL: "As I said earlier, this approach is wrong. Score: 1" ``` The evaluation is about the delivered output, not about re-arguing design decisions that were already made. If the approach was wrong, that should have been caught before delivery. @@ -161,7 +161,7 @@ The evaluation is about the delivered output, not about re-arguing design decisi ### Mixing personal preference with objective gaps ``` -❌ "Score: 3. I don't like Python decorators." +FAIL: "Score: 3. I don't like Python decorators." ``` "Don't like" is not evidence. Cite a concrete readability, testability, or correctness concern, or leave the score at 4+. diff --git a/skills/agent-self-evaluation/references/evaluation-criteria.md b/skills/agent-self-evaluation/references/evaluation-criteria.md index 9a352bf1..fbb3cf90 100644 --- a/skills/agent-self-evaluation/references/evaluation-criteria.md +++ b/skills/agent-self-evaluation/references/evaluation-criteria.md @@ -56,7 +56,7 @@ This reference provides concrete scoring anchors for each axis. Use it when you' ### When the user gave unclear instructions -If the user's request was ambiguous, do NOT penalize completeness for not reading minds. Instead, note in the evaluation: "User's request was ambiguous about __. I chose interpretation __. If they meant __, this score would drop to __." +If the user's request was ambiguous, do NOT penalize completeness for not reading minds. Instead, note in the evaluation: "User's request was ambiguous about [scope]. I chose interpretation [chosen interpretation]. If they meant [alternative interpretation], this score would drop to [score]." ### When the task is inherently simple diff --git a/skills/agent-self-evaluation/templates/evaluation-report.md b/skills/agent-self-evaluation/templates/evaluation-report.md index 46737092..bbc06d4b 100644 --- a/skills/agent-self-evaluation/templates/evaluation-report.md +++ b/skills/agent-self-evaluation/templates/evaluation-report.md @@ -83,4 +83,4 @@ Skip the evaluation if: | ≥4.5 | Deliver as-is. No changes needed. | | 3.5–4.4 | Flag top improvement but deliver. Fix if <30 seconds. | | 2.5–3.4 | State what you'd change. Ask user: "Should I redo [axis] or deliver as-is?" | -| <2.5 | Don't deliver. Say: "This scored __ because __. Let me redo this with [specific fix]." Then redo. | +| <2.5 | Don't deliver. Say: "This scored [score] because [evidence]. Let me redo this with [specific fix]." Then redo. |