mirror of
https://github.com/affaan-m/everything-claude-code.git
synced 2026-06-16 16:36:53 +08:00
fix: address final lint blockers for agent self-evaluation
- Replace U+274C cross-mark examples with ASCII FAIL: prefixes - Ensure agent-evaluator markdown ends with trailing newline - Replace markdown placeholder underscores with bracketed placeholders to satisfy markdownlint MD037
This commit is contained in:
parent
1e679bcb47
commit
149be89d39
@ -203,4 +203,4 @@ TOP IMPROVEMENTS:
|
|||||||
3. [Completeness] Handle 429, connection errors, and timeout
|
3. [Completeness] Handle 429, connection errors, and timeout
|
||||||
|
|
||||||
VERDICT: Redo with specific fixes. Weakest axis: Accuracy (2/5).
|
VERDICT: Redo with specific fixes. Weakest axis: Accuracy (2/5).
|
||||||
```
|
```
|
||||||
|
|||||||
@ -85,7 +85,7 @@ If any axis scored 3 or below:
|
|||||||
|
|
||||||
1. State what you would do differently
|
1. State what you would do differently
|
||||||
2. If the gap is fixable in < 30 seconds (missing link, unclear phrasing), fix it now
|
2. If the gap is fixable in < 30 seconds (missing link, unclear phrasing), fix it now
|
||||||
3. If the gap requires rework, flag it explicitly: "This axis scored __ because __. Re-running with __ would likely raise it to __."
|
3. If the gap requires rework, flag it explicitly: "This axis scored [reason] because [evidence]. Re-running with [specific fix] would likely raise it to [score]."
|
||||||
|
|
||||||
## Code Examples
|
## Code Examples
|
||||||
|
|
||||||
@ -134,7 +134,7 @@ Overall: 2.8 — Wrong library used. Needs httpx rewrite.
|
|||||||
### "Everything is a 5"
|
### "Everything is a 5"
|
||||||
|
|
||||||
```
|
```
|
||||||
❌ Accuracy: 5 — All good.
|
FAIL: Accuracy: 5 — All good.
|
||||||
Completeness: 5 — Everything covered.
|
Completeness: 5 — Everything covered.
|
||||||
Clarity: 5 — Clear.
|
Clarity: 5 — Clear.
|
||||||
```
|
```
|
||||||
@ -144,7 +144,7 @@ No evidence cited. This is self-congratulation, not evaluation. A real 5 require
|
|||||||
### Over-penalizing for scope creep
|
### Over-penalizing for scope creep
|
||||||
|
|
||||||
```
|
```
|
||||||
❌ Completeness: 2 — Didn't handle WebSocket connections or
|
FAIL: Completeness: 2 — Didn't handle WebSocket connections or
|
||||||
gRPC streaming (user didn't ask for these)
|
gRPC streaming (user didn't ask for these)
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -153,7 +153,7 @@ Only evaluate against what the user actually requested, not what you could have
|
|||||||
### Using the evaluation to re-litigate
|
### Using the evaluation to re-litigate
|
||||||
|
|
||||||
```
|
```
|
||||||
❌ "As I said earlier, this approach is wrong. Score: 1"
|
FAIL: "As I said earlier, this approach is wrong. Score: 1"
|
||||||
```
|
```
|
||||||
|
|
||||||
The evaluation is about the delivered output, not about re-arguing design decisions that were already made. If the approach was wrong, that should have been caught before delivery.
|
The evaluation is about the delivered output, not about re-arguing design decisions that were already made. If the approach was wrong, that should have been caught before delivery.
|
||||||
@ -161,7 +161,7 @@ The evaluation is about the delivered output, not about re-arguing design decisi
|
|||||||
### Mixing personal preference with objective gaps
|
### Mixing personal preference with objective gaps
|
||||||
|
|
||||||
```
|
```
|
||||||
❌ "Score: 3. I don't like Python decorators."
|
FAIL: "Score: 3. I don't like Python decorators."
|
||||||
```
|
```
|
||||||
|
|
||||||
"Don't like" is not evidence. Cite a concrete readability, testability, or correctness concern, or leave the score at 4+.
|
"Don't like" is not evidence. Cite a concrete readability, testability, or correctness concern, or leave the score at 4+.
|
||||||
|
|||||||
@ -56,7 +56,7 @@ This reference provides concrete scoring anchors for each axis. Use it when you'
|
|||||||
|
|
||||||
### When the user gave unclear instructions
|
### When the user gave unclear instructions
|
||||||
|
|
||||||
If the user's request was ambiguous, do NOT penalize completeness for not reading minds. Instead, note in the evaluation: "User's request was ambiguous about __. I chose interpretation __. If they meant __, this score would drop to __."
|
If the user's request was ambiguous, do NOT penalize completeness for not reading minds. Instead, note in the evaluation: "User's request was ambiguous about [scope]. I chose interpretation [chosen interpretation]. If they meant [alternative interpretation], this score would drop to [score]."
|
||||||
|
|
||||||
### When the task is inherently simple
|
### When the task is inherently simple
|
||||||
|
|
||||||
|
|||||||
@ -83,4 +83,4 @@ Skip the evaluation if:
|
|||||||
| ≥4.5 | Deliver as-is. No changes needed. |
|
| ≥4.5 | Deliver as-is. No changes needed. |
|
||||||
| 3.5–4.4 | Flag top improvement but deliver. Fix if <30 seconds. |
|
| 3.5–4.4 | Flag top improvement but deliver. Fix if <30 seconds. |
|
||||||
| 2.5–3.4 | State what you'd change. Ask user: "Should I redo [axis] or deliver as-is?" |
|
| 2.5–3.4 | State what you'd change. Ask user: "Should I redo [axis] or deliver as-is?" |
|
||||||
| <2.5 | Don't deliver. Say: "This scored __ because __. Let me redo this with [specific fix]." Then redo. |
|
| <2.5 | Don't deliver. Say: "This scored [score] because [evidence]. Let me redo this with [specific fix]." Then redo. |
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user