mirror of
https://github.com/affaan-m/everything-claude-code.git
synced 2026-06-16 16:36:53 +08:00
fix: address self-evaluation review comments
- Clarify that agent-evaluator reads skills/agent-self-evaluation/SKILL.md directly - Standardize on Conciseness terminology, including helper names - Remove invalid Stop hook matcher and avoid unsupported command-expression matcher examples - Add explicit hook-integration reference path in SKILL.md - Add summary and self-check fields to evaluate.py output, template, and agent spec - Refactor evaluate.py clarity and input parsing helpers - Remove unused task parameter from check_completeness Validation: - python3 -m py_compile skills/agent-self-evaluation/scripts/evaluate.py - evaluate.py high/low example smoke tests - node scripts/ci/validate-agents.js - node scripts/ci/validate-skills.js - node scripts/ci/validate-hooks.js - node scripts/ci/validate-no-personal-paths.js
This commit is contained in:
parent
c0f651cf85
commit
2ea4d779a3
@ -13,7 +13,7 @@ You are a quality evaluator for AI agent output. Your job is to assess agent res
|
||||
- Every score below 5 MUST cite specific evidence from the output
|
||||
- Provide concrete, actionable improvement suggestions
|
||||
- Maintain objectivity — evaluate the output, not the agent's effort or intent
|
||||
- Load the `agent-self-evaluation` skill for the detailed scoring rubric
|
||||
- Read `skills/agent-self-evaluation/SKILL.md` for the detailed scoring rubric. Example input is a standard ECC `SKILL.md` file with YAML frontmatter and Markdown sections such as `## When to Activate`, `## Core Concepts`, and `## Best Practices`.
|
||||
|
||||
- DO NOT re-perform the original task
|
||||
- DO NOT suggest alternative approaches unless the current approach is factually wrong
|
||||
@ -60,6 +60,7 @@ Use this exact format (matches `scripts/evaluate.py` output):
|
||||
============================================================
|
||||
AGENT SELF-EVALUATION REPORT
|
||||
============================================================
|
||||
Summary: Overall score X.X/5 across 5 quality axes.
|
||||
|
||||
Accuracy █████ 5/5
|
||||
+ [Evidence: passing tests, verified claims]
|
||||
@ -87,6 +88,8 @@ CRITICAL ISSUES (axes ≤ 2):
|
||||
[Axis] Score N/5 — specific fix needed
|
||||
(or "None" if no axis ≤ 2)
|
||||
|
||||
Self-check: Would the user agree with this assessment? [Yes/No + brief justification]
|
||||
|
||||
TOP IMPROVEMENTS:
|
||||
1. [Highest impact fix]
|
||||
2. [Second highest]
|
||||
@ -96,7 +99,7 @@ VERDICT: [Deliver as-is / Fix N issues then deliver / Redo from scratch]
|
||||
|
||||
## Output Format
|
||||
|
||||
Always include the structured report above, matching the `scripts/evaluate.py` output format exactly. The report title is "AGENT SELF-EVALUATION REPORT" (not "AGENT EVALUATION REPORT").
|
||||
Always include the structured report above, matching the `scripts/evaluate.py` output format exactly. The report title is "AGENT SELF-EVALUATION REPORT".
|
||||
|
||||
## Examples
|
||||
|
||||
@ -108,6 +111,7 @@ Task: Add retry logic to HTTP client. 3 retries, exponential backoff.
|
||||
============================================================
|
||||
AGENT SELF-EVALUATION REPORT
|
||||
============================================================
|
||||
Summary: Overall score X.X/5 across 5 quality axes.
|
||||
|
||||
Accuracy █████ 5/5
|
||||
+ Tests passing
|
||||
@ -138,6 +142,8 @@ AGENT SELF-EVALUATION REPORT
|
||||
CRITICAL ISSUES (axes ≤ 2):
|
||||
None
|
||||
|
||||
Self-check: Would the user agree with this assessment? Yes — the scores cite passing tests, grep verification, and the remaining gaps are minor.
|
||||
|
||||
TOP IMPROVEMENTS:
|
||||
1. [Completeness] Add connection pool exhaustion to edge cases doc
|
||||
2. [Conciseness] Consolidate verification commands into a single script
|
||||
@ -153,6 +159,7 @@ Task: Same as above.
|
||||
============================================================
|
||||
AGENT SELF-EVALUATION REPORT
|
||||
============================================================
|
||||
Summary: Overall score X.X/5 across 5 quality axes.
|
||||
|
||||
Accuracy ██░░░ 2/5
|
||||
+ Code block present
|
||||
@ -188,6 +195,8 @@ CRITICAL ISSUES (axes ≤ 2):
|
||||
[Accuracy] Score 2/5 — Wrong library. Use httpx.Retry, not urllib3.Retry.
|
||||
[Actionability] Score 2/5 — No deliverable. Create a PR with test file.
|
||||
|
||||
Self-check: Would the user agree with this assessment? Yes — the report cites the wrong library, lack of tests, and missing deliverable.
|
||||
|
||||
TOP IMPROVEMENTS:
|
||||
1. [Accuracy] Switch to httpx.Retry — grep the codebase first
|
||||
2. [Actionability] Create a PR with src/api_client.py + tests
|
||||
|
||||
@ -15,7 +15,7 @@ After completing a complex task, the agent pauses to rate its own output against
|
||||
- After a debugging session that involved 3+ attempts
|
||||
- After producing a design document, architecture decision, or written analysis
|
||||
- When the user asks "how good was that?" or "rate yourself"
|
||||
- At the end of any session Stop hook (if configured — see References)
|
||||
- At the end of any session Stop hook (if configured — see `references/hook-integration.md`)
|
||||
|
||||
## Core Concepts
|
||||
|
||||
|
||||
@ -1,13 +1,12 @@
|
||||
# Hook Integration for Session-Stop Self-Evaluation
|
||||
|
||||
Add this hook to `hooks/hooks.json` to automatically trigger self-evaluation at the end of every session:
|
||||
Add this hook to `hooks/hooks.json` to remind the agent to self-evaluate at the end of every session:
|
||||
|
||||
```json
|
||||
{
|
||||
"hooks": {
|
||||
"Stop": [
|
||||
{
|
||||
"matcher": "true",
|
||||
"hooks": [
|
||||
{
|
||||
"type": "command",
|
||||
@ -21,6 +20,8 @@ Add this hook to `hooks/hooks.json` to automatically trigger self-evaluation at
|
||||
}
|
||||
```
|
||||
|
||||
`Stop` events do not use a `matcher` field. Keep the hook object limited to `hooks` and metadata such as `description`.
|
||||
|
||||
## Integration with the Python Evaluator
|
||||
|
||||
The `scripts/evaluate.py` script can be used as a standalone tool:
|
||||
@ -33,25 +34,27 @@ echo "Your agent response here" | python3 skills/agent-self-evaluation/scripts/e
|
||||
python3 skills/agent-self-evaluation/scripts/evaluate.py --task task.txt --output response.txt
|
||||
```
|
||||
|
||||
To integrate it into hooks, capture the last agent output to a file first, then run the evaluator:
|
||||
To integrate it into hooks, capture the last agent output to a file first, then run the evaluator. For lightweight reminders after shell-based verification, use a simple supported matcher string:
|
||||
|
||||
```json
|
||||
{
|
||||
"PostToolUse": [
|
||||
{
|
||||
"matcher": "tool == \"Bash\" && tool_input.command matches \"(test|pytest|npm test|go test)\"",
|
||||
"matcher": "Bash",
|
||||
"hooks": [
|
||||
{
|
||||
"type": "command",
|
||||
"command": "echo '[Self-Eval] Tests completed. Consider running agent-self-evaluation.'"
|
||||
"command": "echo '[Self-Eval] If this command completed verification for a non-trivial task, consider running agent-self-evaluation.'"
|
||||
}
|
||||
],
|
||||
"description": "Remind agent to self-evaluate after test runs"
|
||||
"description": "Remind agent to self-evaluate after shell verification"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
This avoids documenting unsupported command-expression matcher syntax. If your harness supports command-level matcher expressions, prefer a word-boundary regex such as `\b(pytest|npm test|go test)\b` rather than a broad `test` substring.
|
||||
|
||||
These hooks are opt-in. Add them to your local `hooks/hooks.json` if you want automated evaluation prompts.
|
||||
|
||||
## Manual Usage (Recommended)
|
||||
|
||||
@ -83,7 +83,7 @@ def check_accuracy(text: str) -> AxisScore:
|
||||
return result
|
||||
|
||||
|
||||
def check_completeness(text: str, task: Optional[str] = None) -> AxisScore:
|
||||
def check_completeness(text: str) -> AxisScore:
|
||||
"""Check for requirement coverage, edge cases, error handling."""
|
||||
evidence = []
|
||||
score = 5
|
||||
@ -125,13 +125,36 @@ def check_completeness(text: str, task: Optional[str] = None) -> AxisScore:
|
||||
return result
|
||||
|
||||
|
||||
def _check_jargon(text: str) -> tuple[int, list[str]]:
|
||||
"""Return clarity deductions for unexplained domain jargon."""
|
||||
jargon = [
|
||||
(r"\b(idempotent|race condition|deadlock|thundering herd)\b", "concurrency"),
|
||||
(r"\b(exponential backoff|circuit breaker|bulkhead)\b", "resilience"),
|
||||
(r"\b(ACID|CAP|eventual consistency|linearizability)\b", "database theory"),
|
||||
]
|
||||
explanation_pattern = r"(?i)({domain}|means|refers to|i\.e\.|in other words)"
|
||||
for pattern, domain in jargon:
|
||||
has_term = re.search(pattern, text, re.IGNORECASE)
|
||||
explains_term = re.search(explanation_pattern.format(domain=domain), text)
|
||||
if has_term and not explains_term:
|
||||
return 1, [f"- Domain term used without explanation ({domain})"]
|
||||
return 0, []
|
||||
|
||||
|
||||
def _check_summary(text: str) -> tuple[int, list[str]]:
|
||||
"""Return clarity deduction when long output lacks an early summary."""
|
||||
summary_terms = ["summary", "tldr", "overview", "in short"]
|
||||
has_early_summary = any(term in text[:100].lower() for term in summary_terms)
|
||||
if not has_early_summary and count_words(text) > 300:
|
||||
return 1, ["- No summary/TLDR in first 100 words (text is 300+ words)"]
|
||||
return 0, []
|
||||
|
||||
|
||||
def check_clarity(text: str) -> AxisScore:
|
||||
"""Check for structure, readability, jargon handling."""
|
||||
evidence = []
|
||||
score = 5
|
||||
deductions = 0
|
||||
|
||||
# Positive signals
|
||||
if re.search(r"^#{1,3}\s+", text, re.MULTILINE):
|
||||
evidence.append("+ Uses headings for structure")
|
||||
if re.search(r"```", text):
|
||||
@ -139,33 +162,16 @@ def check_clarity(text: str) -> AxisScore:
|
||||
if re.search(r"^\s*[-*]\s+", text, re.MULTILINE):
|
||||
evidence.append("+ Uses bullet points")
|
||||
|
||||
# Negative signals
|
||||
# Wall of text: long paragraph without breaks
|
||||
paragraphs = [p for p in text.split("\n\n") if p.strip()]
|
||||
for p in paragraphs:
|
||||
if count_words(p) > 200:
|
||||
for paragraph in [p for p in text.split("\n\n") if p.strip()]:
|
||||
if count_words(paragraph) > 200:
|
||||
deductions += 1
|
||||
evidence.append("- Wall-of-text paragraph (>200 words without break)")
|
||||
break
|
||||
|
||||
# Jargon without definition
|
||||
jargon = [
|
||||
(r"\b(idempotent|race condition|deadlock|thundering herd)\b", "concurrency"),
|
||||
(r"\b(exponential backoff|circuit breaker|bulkhead)\b", "resilience"),
|
||||
(r"\b(ACID|CAP|eventual consistency|linearizability)\b", "database theory"),
|
||||
]
|
||||
for pattern, domain in jargon:
|
||||
if re.search(pattern, text, re.IGNORECASE):
|
||||
if not re.search(rf"(?i)({domain}|means|refers to|i\.e\.|in other words)", text):
|
||||
deductions += 1
|
||||
evidence.append(f"- Domain term used without explanation ({domain})")
|
||||
break
|
||||
|
||||
if not any(t in text[:100].lower() for t in ["summary", "tldr", "overview", "in short"]):
|
||||
# No early summary — penalize only if text is long
|
||||
if count_words(text) > 300:
|
||||
deductions += 1
|
||||
evidence.append("- No summary/TLDR in first 100 words (text is 300+ words)")
|
||||
jargon_deductions, jargon_evidence = _check_jargon(text)
|
||||
summary_deductions, summary_evidence = _check_summary(text)
|
||||
deductions += jargon_deductions + summary_deductions
|
||||
evidence.extend(jargon_evidence + summary_evidence)
|
||||
|
||||
if deductions >= 3:
|
||||
score = 2
|
||||
@ -173,6 +179,8 @@ def check_clarity(text: str) -> AxisScore:
|
||||
score = 3
|
||||
elif deductions == 1:
|
||||
score = 4
|
||||
else:
|
||||
score = 5
|
||||
|
||||
if not evidence:
|
||||
evidence.append("+ Well-structured with no clarity issues detected")
|
||||
@ -227,7 +235,7 @@ def check_actionability(text: str) -> AxisScore:
|
||||
return result
|
||||
|
||||
|
||||
def check_concision(text: str, task: Optional[str] = None) -> AxisScore:
|
||||
def check_conciseness(text: str, task: Optional[str] = None) -> AxisScore:
|
||||
"""Check for redundancy, filler, information density."""
|
||||
evidence = []
|
||||
score = 5
|
||||
@ -278,10 +286,10 @@ def evaluate(task: Optional[str], output: str) -> list[AxisScore]:
|
||||
"""Run all 5 axis checks and return scored results."""
|
||||
return [
|
||||
check_accuracy(output),
|
||||
check_completeness(output, task),
|
||||
check_completeness(output),
|
||||
check_clarity(output),
|
||||
check_actionability(output),
|
||||
check_concision(output, task),
|
||||
check_conciseness(output, task),
|
||||
]
|
||||
|
||||
|
||||
@ -292,13 +300,13 @@ def format_report(scores: list[AxisScore]) -> str:
|
||||
lines.append("=" * 60)
|
||||
lines.append("AGENT SELF-EVALUATION REPORT")
|
||||
lines.append("=" * 60)
|
||||
lines.append(f"Summary: Overall score {avg:.1f}/5 across 5 quality axes.")
|
||||
lines.append("")
|
||||
|
||||
for s in scores:
|
||||
bar = "█" * s.score + "░" * (5 - s.score)
|
||||
lines.append(f" {s.name:<15} {bar} {s.score}/5")
|
||||
for e in s.evidence:
|
||||
lines.append(f" {e}")
|
||||
lines.extend(f" {e}" for e in s.evidence)
|
||||
if s.improvement:
|
||||
lines.append(f" → {s.improvement}")
|
||||
lines.append("")
|
||||
@ -316,6 +324,8 @@ def format_report(scores: list[AxisScore]) -> str:
|
||||
lines.append(" None")
|
||||
|
||||
lines.append("")
|
||||
lines.append("Self-check: Would the user agree with this assessment? [Yes/No + brief justification]")
|
||||
lines.append("")
|
||||
|
||||
# Top improvements (axes scoring < 4, ranked by impact)
|
||||
improvements = [(s, s.improvement) for s in scores if s.improvement and s.score < 4]
|
||||
@ -344,6 +354,31 @@ def format_report(scores: list[AxisScore]) -> str:
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def _read_file_or_text(path: Optional[str], required: bool = False) -> Optional[str]:
|
||||
"""Read a file path or return inline text when allowed."""
|
||||
if path is None:
|
||||
return None
|
||||
try:
|
||||
with open(path) as f:
|
||||
return f.read()
|
||||
except FileNotFoundError:
|
||||
if required:
|
||||
print(f"Error: output file '{path}' not found", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
return path
|
||||
|
||||
|
||||
def _read_input(args: argparse.Namespace) -> tuple[Optional[str], str]:
|
||||
"""Read task and output for interactive, file, or pipe mode."""
|
||||
if args.interactive:
|
||||
task = input("Task description: ").strip()
|
||||
print("Paste agent output (Ctrl+D to finish):")
|
||||
return task, sys.stdin.read()
|
||||
if args.output:
|
||||
return _read_file_or_text(args.task), _read_file_or_text(args.output, required=True) or ""
|
||||
return _read_file_or_text(args.task), sys.stdin.read()
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Evaluate agent output against the 5-axis rubric"
|
||||
@ -353,38 +388,7 @@ def main():
|
||||
parser.add_argument("--interactive", action="store_true", help="Prompt for task and read output from stdin")
|
||||
args = parser.parse_args()
|
||||
|
||||
task = None
|
||||
output = None
|
||||
|
||||
if args.interactive:
|
||||
task = input("Task description: ").strip()
|
||||
print("Paste agent output (Ctrl+D to finish):")
|
||||
output = sys.stdin.read()
|
||||
elif args.task and args.output:
|
||||
# Read task
|
||||
try:
|
||||
with open(args.task) as f:
|
||||
task = f.read()
|
||||
except FileNotFoundError:
|
||||
task = args.task # Treat as inline text
|
||||
|
||||
# Read output
|
||||
try:
|
||||
with open(args.output) as f:
|
||||
output = f.read()
|
||||
except FileNotFoundError:
|
||||
print(f"Error: output file '{args.output}' not found", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
else:
|
||||
# Pipe mode: read output from stdin
|
||||
output = sys.stdin.read()
|
||||
if args.task:
|
||||
try:
|
||||
with open(args.task) as f:
|
||||
task = f.read()
|
||||
except FileNotFoundError:
|
||||
task = args.task
|
||||
|
||||
task, output = _read_input(args)
|
||||
if not output:
|
||||
print("Error: no output to evaluate", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
@ -6,6 +6,7 @@ Copy this template and fill in after completing a task. The format matches `scri
|
||||
============================================================
|
||||
AGENT SELF-EVALUATION REPORT
|
||||
============================================================
|
||||
Summary: Overall score X.X/5 across 5 quality axes.
|
||||
|
||||
Accuracy █████ 5/5 or ███░░ 3/5
|
||||
+ [Evidence: passing tests, verified claims]
|
||||
@ -38,6 +39,8 @@ CRITICAL ISSUES (axes ≤ 2):
|
||||
[Axis] Score N/5 — specific fix needed
|
||||
(or "None" if no axis ≤ 2)
|
||||
|
||||
Self-check: Would the user agree with this assessment? [Yes/No + brief justification]
|
||||
|
||||
TOP IMPROVEMENTS:
|
||||
1. [Highest impact fix]
|
||||
2. [Second highest]
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user