8 Commits

Author SHA1 Message Date
Hawthorn
149be89d39 fix: address final lint blockers for agent self-evaluation
- Replace U+274C cross-mark examples with ASCII FAIL: prefixes
- Ensure agent-evaluator markdown ends with trailing newline
- Replace markdown placeholder underscores with bracketed placeholders to satisfy markdownlint MD037
2026-06-11 17:58:57 +05:30
Hawthorn
1e679bcb47 fix(agents): harden git commands against pager-based code execution
Git commands (log, diff, show) can execute arbitrary code via:
- core.pager set in repo-local .git/config
- diff.external pointing to an attacker-controlled binary
- filter drivers in .gitattributes

Mitigation: require --no-pager flag, recommend -c core.pager=cat
to disable pager-driven execution. Moved git commands from the
unqualified allowlist to a hardened allowlist with explicit flags.
2026-06-10 18:30:22 +05:30
Hawthorn
08f66b4909 fix(agents): add Bash tool guardrails to agent-evaluator
List allowed read-only commands (grep, cat, ls, find, head, tail, wc, stat,
git log/diff/show) and explicitly forbid destructive commands (rm, mv, chmod,
git push, git commit, sudo, pip/npm install, curl|wget piping to sh). Any
write/delete/remote-push requires explicit user confirmation.
2026-06-10 18:18:58 +05:30
Hawthorn
7c0a0049a8 fix: address second-round review comments
- Replace httpx.Retry references with correct httpx API usage across all files
  (httpx has no built-in Retry class; use HTTPTransport/Limits instead)
- Fix _check_summary to check first 100 words (not 100 characters)
- Fix template to only show → improvement arrow for non-5 scores
- Clarify hook documentation: hook echoes reminder, does not run evaluator
- Add return type annotation to main()
- Make required parameter keyword-only in _read_file_or_text
2026-06-10 17:59:25 +05:30
Hawthorn
2ea4d779a3 fix: address self-evaluation review comments
- Clarify that agent-evaluator reads skills/agent-self-evaluation/SKILL.md directly
- Standardize on Conciseness terminology, including helper names
- Remove invalid Stop hook matcher and avoid unsupported command-expression matcher examples
- Add explicit hook-integration reference path in SKILL.md
- Add summary and self-check fields to evaluate.py output, template, and agent spec
- Refactor evaluate.py clarity and input parsing helpers
- Remove unused task parameter from check_completeness

Validation:
- python3 -m py_compile skills/agent-self-evaluation/scripts/evaluate.py
- evaluate.py high/low example smoke tests
- node scripts/ci/validate-agents.js
- node scripts/ci/validate-skills.js
- node scripts/ci/validate-hooks.js
- node scripts/ci/validate-no-personal-paths.js
2026-06-10 17:25:24 +05:30
Hawthorn
c0f651cf85 fix: align report format across evaluate.py, agent spec, and template
- evaluate.py: add CRITICAL ISSUES (axes ≤ 2) section, VERDICT line
- agent-evaluator.md: match format_report output exactly (title, evidence markers, bar graphs)
- templates/evaluation-report.md: match evaluate.py output format
- All now produce identical AGENT SELF-EVALUATION REPORT structure

Single authoritative format: evaluate.py's format_report() output.
2026-06-10 17:11:44 +05:30
Hawthorn
d0a84db177
Update agents/agent-evaluator.md
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2026-06-10 17:08:31 +05:30
Hawthorn
bd45947941 feat(skills,agents): add agent-self-evaluation skill and agent-evaluator persona
Add structured 5-axis self-evaluation framework for agent output quality:
- Accuracy, Completeness, Clarity, Actionability, Conciseness
- Evidence-based scoring with concrete improvement suggestions
- Standalone Python evaluator script with keyword heuristics
- Detailed scoring anchors reference guide
- High-score and low-score annotated examples
- Reusable evaluation report template
- Optional hook integration for session-stop evaluation

Agent persona (agent-evaluator) provides a dedicated subagent
for applying the rubric to agent output with tool-backed verification.

All files tested: Python script runs, examples score correctly
(high 4.2, low 3.4), frontmatter parses clean, 183 lines (under 500).
2026-06-10 16:56:18 +05:30