mirror of https://github.com/affaan-m/everything-claude-code.git synced 2026-06-16 16:36:53 +08:00

Hawthorn bd45947941 feat(skills,agents): add agent-self-evaluation skill and agent-evaluator persona

Add structured 5-axis self-evaluation framework for agent output quality:
- Accuracy, Completeness, Clarity, Actionability, Conciseness
- Evidence-based scoring with concrete improvement suggestions
- Standalone Python evaluator script with keyword heuristics
- Detailed scoring anchors reference guide
- High-score and low-score annotated examples
- Reusable evaluation report template
- Optional hook integration for session-stop evaluation

Agent persona (agent-evaluator) provides a dedicated subagent
for applying the rubric to agent output with tool-backed verification.

All files tested: Python script runs, examples score correctly
(high 4.2, low 3.4), frontmatter parses clean, 183 lines (under 500).

2026-06-10 16:56:18 +05:30

3.6 KiB

Raw Blame History

Low-Score Example: Adding Retry Logic

Task

Add retry logic with exponential backoff to all HTTP requests in the API client. Max 3 retries, initial delay 1s, max delay 30s.

Agent Output

Here's a retry implementation. I think this should work for most cases.

I used the urllib3 library since it has built-in retry support. The Retry class handles backoff automatically so we just need to configure it.

from urllib3.util import Retry

# Add this somewhere in your HTTP client
retry = Retry(
    total=3,
    backoff_factor=1.0,
    status_forcelist=[500, 502, 503, 504]
)

You'll want to test this before deploying. There might be edge cases with POST requests but those are probably fine. I haven't tested the timeout behavior but it should work.

Let me know if you need anything else!

Evaluation Report

============================================================
AGENT SELF-EVALUATION REPORT
============================================================

  Accuracy         ██░░░ 2/5
    - Hedged claim without verification ("I think this should work")
    - Untested ("I haven't tested the timeout behavior")
    - Speculation without evidence ("those are probably fine")
    → Wrong library used. Project uses httpx, not urllib3.
      urllib3.util.Retry is incompatible with httpx transport.

  Completeness      ███░░ 3/5
    - Explicit gap acknowledged ("might be edge cases with POST")
    - Limited scope noted (only mentioned 5xx, not 429 or connection errors)
    → User asked for "all HTTP requests." Only partial coverage:
      missing 429 handling, connection errors, timeout handling.

  Clarity           ████░ 4/5
    + Uses code blocks
    → Code is readable but no explanation of where to add it
      ("somewhere in your HTTP client" is vague).

  Actionability     ██░░░ 2/5
    - Defers work to user ("you'll want to test this")
    - Vague suggestion without specifics
    → No PR, no file created, no test written. User has to:
      1. Figure out where to add the code
      2. Fix the library mismatch (httpx not urllib3)
      3. Write tests
      4. Handle POST idempotency
      5. Test timeout behavior

  Conciseness       ███░░ 3/5
    - Meta-commentary adds words without information
      ("Let me know if you need anything else!")
    → 120 words. Low word count but low information density.
      Half the text is hedging and disclaimers, not substance.

  OVERALL           2.8/5

TOP IMPROVEMENTS (axes scoring < 4):
  [Accuracy] Switch to httpx.Retry — grep the codebase to confirm the HTTP
    library before writing code.
  [Actionability] Create a PR with the changed file + test file. Run the
    tests. End with "PR #N ready to merge."
  [Completeness] List what's covered AND what's not. If POST retry is
    unsafe, say so explicitly with reasoning.

Why This Scores Poorly

Accuracy fails at the most basic level — wrong library. One grep httpx src/ would have caught this. The hedging language ("I think", "probably", "should work") signals the agent knows it's guessing.
Not actionable. The user received a code snippet and a list of things they need to do. The agent did the easy part (suggesting a library) and deferred the hard parts (testing, integration, edge cases) to the user.
Completeness gaps are acknowledged but not fixed. "Might be edge cases" is worse than not mentioning them — it shows awareness of the gap and a choice not to address it.
Information density is low. 120 words, of which ~60 are hedging/disclaimers/politeness. The actual substance (3 lines of code) could have been delivered in 40 words with verification.

3.6 KiB Raw Blame History