mirror of
https://github.com/affaan-m/everything-claude-code.git
synced 2026-06-20 03:40:29 +08:00
fix: address second-round review comments
- Replace httpx.Retry references with correct httpx API usage across all files (httpx has no built-in Retry class; use HTTPTransport/Limits instead) - Fix _check_summary to check first 100 words (not 100 characters) - Fix template to only show → improvement arrow for non-5 scores - Clarify hook documentation: hook echoes reminder, does not run evaluator - Add return type annotation to main() - Make required parameter keyword-only in _read_file_or_text
This commit is contained in:
parent
2ea4d779a3
commit
7c0a0049a8
@ -63,24 +63,20 @@ AGENT SELF-EVALUATION REPORT
|
|||||||
Summary: Overall score X.X/5 across 5 quality axes.
|
Summary: Overall score X.X/5 across 5 quality axes.
|
||||||
|
|
||||||
Accuracy █████ 5/5
|
Accuracy █████ 5/5
|
||||||
+ [Evidence: passing tests, verified claims]
|
+ [Evidence: passing tests, verified claims] (no → when score = 5)
|
||||||
→ [Improvement if score < 5]
|
|
||||||
|
|
||||||
Completeness █████ 5/5
|
Completeness ████░ 4/5
|
||||||
+ [What's covered]
|
+ [What's covered]
|
||||||
→ [Improvement if score < 5]
|
→ [Improvement: only shown when score < 5]
|
||||||
|
|
||||||
Clarity █████ 5/5
|
Clarity █████ 5/5
|
||||||
+ [Structure signals]
|
+ [Structure signals] (no → when score = 5)
|
||||||
→ [Improvement if score < 5]
|
|
||||||
|
|
||||||
Actionability █████ 5/5
|
Actionability █████ 5/5
|
||||||
+ [User can act immediately]
|
+ [User can act immediately] (no → when score = 5)
|
||||||
→ [Improvement if score < 5]
|
|
||||||
|
|
||||||
Conciseness █████ 5/5
|
Conciseness █████ 5/5
|
||||||
+ [Information density]
|
+ [Information density] (no → when score = 5)
|
||||||
→ [Improvement if score < 5]
|
|
||||||
|
|
||||||
OVERALL X.X/5
|
OVERALL X.X/5
|
||||||
|
|
||||||
@ -115,7 +111,7 @@ Summary: Overall score X.X/5 across 5 quality axes.
|
|||||||
|
|
||||||
Accuracy █████ 5/5
|
Accuracy █████ 5/5
|
||||||
+ Tests passing
|
+ Tests passing
|
||||||
+ grep confirms httpx.Retry used correctly
|
+ grep confirms httpx transport configured correctly
|
||||||
+ Import verified
|
+ Import verified
|
||||||
|
|
||||||
Completeness ████░ 4/5
|
Completeness ████░ 4/5
|
||||||
@ -192,13 +188,13 @@ Summary: Overall score X.X/5 across 5 quality axes.
|
|||||||
OVERALL 2.8/5
|
OVERALL 2.8/5
|
||||||
|
|
||||||
CRITICAL ISSUES (axes ≤ 2):
|
CRITICAL ISSUES (axes ≤ 2):
|
||||||
[Accuracy] Score 2/5 — Wrong library. Use httpx.Retry, not urllib3.Retry.
|
[Accuracy] Score 2/5 — Wrong library. Use httpx, not urllib3.
|
||||||
[Actionability] Score 2/5 — No deliverable. Create a PR with test file.
|
[Actionability] Score 2/5 — No deliverable. Create a PR with test file.
|
||||||
|
|
||||||
Self-check: Would the user agree with this assessment? Yes — the report cites the wrong library, lack of tests, and missing deliverable.
|
Self-check: Would the user agree with this assessment? Yes — the report cites the wrong library, lack of tests, and missing deliverable.
|
||||||
|
|
||||||
TOP IMPROVEMENTS:
|
TOP IMPROVEMENTS:
|
||||||
1. [Accuracy] Switch to httpx.Retry — grep the codebase first
|
1. [Accuracy] Switch to httpx — grep the codebase first
|
||||||
2. [Actionability] Create a PR with src/api_client.py + tests
|
2. [Actionability] Create a PR with src/api_client.py + tests
|
||||||
3. [Completeness] Handle 429, connection errors, and timeout
|
3. [Completeness] Handle 429, connection errors, and timeout
|
||||||
|
|
||||||
|
|||||||
@ -114,7 +114,7 @@ Overall: 4.6 — One gap (timeout handling). Fix before merging.
|
|||||||
Task: Add retry logic to HTTP client
|
Task: Add retry logic to HTTP client
|
||||||
|
|
||||||
Scorecard:
|
Scorecard:
|
||||||
Accuracy: 2 — Used urllib3.Retry which doesn't exist in our
|
Accuracy: 2 — Used urllib3 which doesn't match our
|
||||||
httpx-based codebase. Wrong library.
|
httpx-based codebase. Wrong library.
|
||||||
Completeness: 3 — Works for GET. POST/PUT not handled (user
|
Completeness: 3 — Works for GET. POST/PUT not handled (user
|
||||||
said "all HTTP requests").
|
said "all HTTP requests").
|
||||||
@ -125,7 +125,7 @@ Scorecard:
|
|||||||
3 places instead of one shared RetryConfig object.
|
3 places instead of one shared RetryConfig object.
|
||||||
|
|
||||||
Overall: 2.8 — Wrong library used. Needs httpx rewrite.
|
Overall: 2.8 — Wrong library used. Needs httpx rewrite.
|
||||||
Fix accuracy first (switch to httpx.Retry), then extend to all
|
Fix accuracy first (switch to httpx), then extend to all
|
||||||
HTTP methods, then consolidate config.
|
HTTP methods, then consolidate config.
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -171,7 +171,7 @@ The evaluation is about the delivered output, not about re-arguing design decisi
|
|||||||
- **Evaluate the output, not the process.** The user cares about what you delivered, not how many iterations you took.
|
- **Evaluate the output, not the process.** The user cares about what you delivered, not how many iterations you took.
|
||||||
- **One improvement per weak axis.** Don't list 5 things for one axis — pick the highest-impact gap.
|
- **One improvement per weak axis.** Don't list 5 things for one axis — pick the highest-impact gap.
|
||||||
- **Tie improvements to user impact.** "Missing error handling means the user's API call will crash silently" beats "add error handling."
|
- **Tie improvements to user impact.** "Missing error handling means the user's API call will crash silently" beats "add error handling."
|
||||||
- **Be specific about what 'fixed' looks like.** "Re-run with httpx.Retry( total=3, backoff_factor=1.0 )" beats "fix the library issue."
|
- **Be specific about what 'fixed' looks like.** "Re-run with httpx transport configured for retries" beats "fix the library issue."
|
||||||
- **Use tool outputs as evidence.** If tests passed, cite them. If lint is clean, cite it. Don't guess — grep for the proof.
|
- **Use tool outputs as evidence.** If tests passed, cite them. If lint is clean, cite it. Don't guess — grep for the proof.
|
||||||
- **If you can't find any gaps, try harder.** A perfect score across all 5 axes is rare. Ask: "If I were the user, what would annoy me about this output?"
|
- **If you can't find any gaps, try harder.** A perfect score across all 5 axes is rare. Ask: "If I were the user, what would annoy me about this output?"
|
||||||
|
|
||||||
|
|||||||
@ -6,10 +6,10 @@
|
|||||||
## Agent Output
|
## Agent Output
|
||||||
|
|
||||||
### Summary
|
### Summary
|
||||||
Added `httpx.Retry` transport to the shared `ApiClient` class. All HTTP methods (GET, POST, PUT, DELETE) now retry on 429, 5xx, and connection errors with exponential backoff. 42 tests pass. Single PR ready to merge.
|
Added `httpx.HTTPTransport` with retry configuration to the shared `ApiClient` class. All HTTP methods (GET, POST, PUT, DELETE) now retry on 429, 5xx, and connection errors with exponential backoff. 42 tests pass. Single PR ready to merge.
|
||||||
|
|
||||||
### Files Changed
|
### Files Changed
|
||||||
- `src/api_client.py` — Added `RetryTransport` wrapper with `httpx.Retry( total=3, backoff_factor=1.0, status_forcelist=[429, 500, 502, 503, 504] )`
|
- `src/api_client.py` — Added `RetryTransport` wrapper with `httpx.HTTPTransport(retries=3)` and exponential backoff configured via `httpx.Limits`
|
||||||
- `tests/test_api_client.py` — Added 8 tests: happy path, retry on 429, retry on 503, max retries exhausted, timeout, backoff timing assertion, idempotent GET retry, no-retry on 4xx (except 429)
|
- `tests/test_api_client.py` — Added 8 tests: happy path, retry on 429, retry on 503, max retries exhausted, timeout, backoff timing assertion, idempotent GET retry, no-retry on 4xx (except 429)
|
||||||
|
|
||||||
### Verification
|
### Verification
|
||||||
|
|||||||
@ -7,7 +7,7 @@
|
|||||||
|
|
||||||
Here's a retry implementation. I think this should work for most cases.
|
Here's a retry implementation. I think this should work for most cases.
|
||||||
|
|
||||||
I used the urllib3 library since it has built-in retry support. The Retry class handles backoff automatically so we just need to configure it.
|
I used the urllib3 library since it has built-in retry support. The Retry class handles backoff automatically.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
from urllib3.util import Retry
|
from urllib3.util import Retry
|
||||||
@ -38,7 +38,7 @@ AGENT SELF-EVALUATION REPORT
|
|||||||
- Untested ("I haven't tested the timeout behavior")
|
- Untested ("I haven't tested the timeout behavior")
|
||||||
- Speculation without evidence ("those are probably fine")
|
- Speculation without evidence ("those are probably fine")
|
||||||
→ Wrong library used. Project uses httpx, not urllib3.
|
→ Wrong library used. Project uses httpx, not urllib3.
|
||||||
urllib3.util.Retry is incompatible with httpx transport.
|
urllib3.util.Retry is incompatible with httpx.
|
||||||
|
|
||||||
Completeness ███░░ 3/5
|
Completeness ███░░ 3/5
|
||||||
- Explicit gap acknowledged ("might be edge cases with POST")
|
- Explicit gap acknowledged ("might be edge cases with POST")
|
||||||
@ -70,7 +70,7 @@ AGENT SELF-EVALUATION REPORT
|
|||||||
OVERALL 2.8/5
|
OVERALL 2.8/5
|
||||||
|
|
||||||
TOP IMPROVEMENTS (axes scoring < 4):
|
TOP IMPROVEMENTS (axes scoring < 4):
|
||||||
[Accuracy] Switch to httpx.Retry — grep the codebase to confirm the HTTP
|
[Accuracy] Switch to httpx — grep the codebase to confirm the HTTP
|
||||||
library before writing code.
|
library before writing code.
|
||||||
[Actionability] Create a PR with the changed file + test file. Run the
|
[Actionability] Create a PR with the changed file + test file. Run the
|
||||||
tests. End with "PR #N ready to merge."
|
tests. End with "PR #N ready to merge."
|
||||||
|
|||||||
@ -6,8 +6,8 @@ This reference provides concrete scoring anchors for each axis. Use it when you'
|
|||||||
|
|
||||||
| Score | Anchor | Example |
|
| Score | Anchor | Example |
|
||||||
|---|---|---|
|
|---|---|---|
|
||||||
| 5 | All facts verified against tool output, docs, or authoritative sources. No errors. | Used `httpx.Retry` — confirmed in httpx docs. All method names verified with grep against codebase. |
|
| 5 | All facts verified against tool output, docs, or authoritative sources. No errors. | Configured retry via httpx transport — confirmed in httpx docs. All method names verified with grep against codebase. |
|
||||||
| 4 | One minor inaccuracy that doesn't affect correctness. | Correct library, wrong default value for one parameter (httpx defaults to 1.0s, claimed 0.5s). |
|
| 4 | One minor inaccuracy that doesn't affect correctness. | Correct library, wrong default value for one parameter (claimed 0.5s, docs say 1.0s). |
|
||||||
| 3 | One significant factual error, or 3+ minor inaccuracies. | Used `urllib3.Retry` in an httpx codebase. Works in this one case but wrong library. |
|
| 3 | One significant factual error, or 3+ minor inaccuracies. | Used `urllib3.Retry` in an httpx codebase. Works in this one case but wrong library. |
|
||||||
| 2 | Multiple significant errors. Output would fail if followed. | Claimed "add this to package.json" but project uses pyproject.toml. Two other config claims also wrong. |
|
| 2 | Multiple significant errors. Output would fail if followed. | Claimed "add this to package.json" but project uses pyproject.toml. Two other config claims also wrong. |
|
||||||
| 1 | Fundamentally incorrect. Output contradicts itself or known facts. | Code has syntax errors. API endpoint doesn't exist. Claims a function signature that grep disproves. |
|
| 1 | Fundamentally incorrect. Output contradicts itself or known facts. | Code has syntax errors. API endpoint doesn't exist. Claims a function signature that grep disproves. |
|
||||||
|
|||||||
@ -1,6 +1,6 @@
|
|||||||
# Hook Integration for Session-Stop Self-Evaluation
|
# Hook Integration for Session-Stop Self-Evaluation
|
||||||
|
|
||||||
Add this hook to `hooks/hooks.json` to remind the agent to self-evaluate at the end of every session:
|
Add this hook to `hooks/hooks.json` to remind the agent to self-evaluate at the end of every session (the hook echoes a reminder; it does not run the evaluator automatically):
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
|
|||||||
@ -144,7 +144,7 @@ def _check_jargon(text: str) -> tuple[int, list[str]]:
|
|||||||
def _check_summary(text: str) -> tuple[int, list[str]]:
|
def _check_summary(text: str) -> tuple[int, list[str]]:
|
||||||
"""Return clarity deduction when long output lacks an early summary."""
|
"""Return clarity deduction when long output lacks an early summary."""
|
||||||
summary_terms = ["summary", "tldr", "overview", "in short"]
|
summary_terms = ["summary", "tldr", "overview", "in short"]
|
||||||
has_early_summary = any(term in text[:100].lower() for term in summary_terms)
|
has_early_summary = any(term in ' '.join(text.split()[:100]).lower() for term in summary_terms)
|
||||||
if not has_early_summary and count_words(text) > 300:
|
if not has_early_summary and count_words(text) > 300:
|
||||||
return 1, ["- No summary/TLDR in first 100 words (text is 300+ words)"]
|
return 1, ["- No summary/TLDR in first 100 words (text is 300+ words)"]
|
||||||
return 0, []
|
return 0, []
|
||||||
@ -354,7 +354,7 @@ def format_report(scores: list[AxisScore]) -> str:
|
|||||||
return "\n".join(lines)
|
return "\n".join(lines)
|
||||||
|
|
||||||
|
|
||||||
def _read_file_or_text(path: Optional[str], required: bool = False) -> Optional[str]:
|
def _read_file_or_text(path: Optional[str], *, required: bool = False) -> Optional[str]:
|
||||||
"""Read a file path or return inline text when allowed."""
|
"""Read a file path or return inline text when allowed."""
|
||||||
if path is None:
|
if path is None:
|
||||||
return None
|
return None
|
||||||
@ -379,7 +379,7 @@ def _read_input(args: argparse.Namespace) -> tuple[Optional[str], str]:
|
|||||||
return _read_file_or_text(args.task), sys.stdin.read()
|
return _read_file_or_text(args.task), sys.stdin.read()
|
||||||
|
|
||||||
|
|
||||||
def main():
|
def main() -> None:
|
||||||
parser = argparse.ArgumentParser(
|
parser = argparse.ArgumentParser(
|
||||||
description="Evaluate agent output against the 5-axis rubric"
|
description="Evaluate agent output against the 5-axis rubric"
|
||||||
)
|
)
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user