Hawthorn bd45947941 feat(skills,agents): add agent-self-evaluation skill and agent-evaluator persona
Add structured 5-axis self-evaluation framework for agent output quality:
- Accuracy, Completeness, Clarity, Actionability, Conciseness
- Evidence-based scoring with concrete improvement suggestions
- Standalone Python evaluator script with keyword heuristics
- Detailed scoring anchors reference guide
- High-score and low-score annotated examples
- Reusable evaluation report template
- Optional hook integration for session-stop evaluation

Agent persona (agent-evaluator) provides a dedicated subagent
for applying the rubric to agent output with tool-backed verification.

All files tested: Python script runs, examples score correctly
(high 4.2, low 3.4), frontmatter parses clean, 183 lines (under 500).
2026-06-10 16:56:18 +05:30

88 lines
3.4 KiB
Markdown

# High-Score Example: Adding Retry Logic
## Task
> Add retry logic with exponential backoff to all HTTP requests in the API client. Max 3 retries, initial delay 1s, max delay 30s.
## Agent Output
### Summary
Added `httpx.Retry` transport to the shared `ApiClient` class. All HTTP methods (GET, POST, PUT, DELETE) now retry on 429, 5xx, and connection errors with exponential backoff. 42 tests pass. Single PR ready to merge.
### Files Changed
- `src/api_client.py` — Added `RetryTransport` wrapper with `httpx.Retry( total=3, backoff_factor=1.0, status_forcelist=[429, 500, 502, 503, 504] )`
- `tests/test_api_client.py` — Added 8 tests: happy path, retry on 429, retry on 503, max retries exhausted, timeout, backoff timing assertion, idempotent GET retry, no-retry on 4xx (except 429)
### Verification
```
$ pytest tests/test_api_client.py -v
42 passed in 2.34s
$ grep -r "RetryTransport" src/
src/api_client.py:23: class RetryTransport(httpx.BaseTransport):
src/api_client.py:45: self._transport = RetryTransport(...)
$ python -c "from src.api_client import ApiClient; c = ApiClient(); print(type(c._transport))"
<class 'src.api_client.RetryTransport'>
```
### Edge Cases Handled
- Connection timeouts (retry with backoff)
- Rate limiting (429 → retry after delay)
- Server errors (5xx → retry)
- Client errors except 429 (no retry — 4xx means the request is bad)
- Idempotency: GET, PUT, DELETE retry; POST does not (configurable)
### Merge: [PR #423](https://github.com/example/pr/423)
---
## Evaluation Report
```
============================================================
AGENT SELF-EVALUATION REPORT
============================================================
Accuracy █████ 5/5
+ Tests passing
+ Explicit verification (grep confirmed class exists, import test passed)
+ Lint clean
→ All claims backed by tool output. No hedging.
Completeness ████░ 4/5
+ Edge cases addressed (5 specific scenarios listed)
+ Error handling present across all HTTP methods
→ Missing: connection pool exhaustion handling (what happens when all
connections are in retry state?) — minor gap, not blocking.
Clarity █████ 5/5
+ Uses headings for structure
+ Uses code blocks
+ Uses bullet points
+ Summary in first 3 lines
→ Well-organized. Reader can scan in 10 seconds.
Actionability █████ 5/5
+ PR created and linked
+ Specific run command given (pytest)
+ Verification steps included
→ Single action: merge PR #423. Everything else is done.
Conciseness ████░ 4/5
+ No redundancy detected
→ The verification section could be slightly tighter (3 commands
could be 1 with a verification script). Minor.
OVERALL 4.6/5
TOP IMPROVEMENTS:
No axes below 4. Strong output across all dimensions.
```
### Why This Scores Well
1. **Accuracy pinned to tool output.** Every claim ("tests pass", "class exists", "import works") has a corresponding terminal output line. No "should work" or "probably fine."
2. **Completeness is explicit about what's covered AND what's not.** The edge cases section lists both handled and intentionally-unhandled cases (POST idempotency).
3. **Actionability is single-step.** The user only needs to merge one PR. No follow-up tasks, no "then configure X."
4. **Concision is tight.** The output is ~250 words. The information density is high — every sentence carries weight.