mirror of
https://github.com/affaan-m/everything-claude-code.git
synced 2026-06-19 02:50:17 +08:00
The official Agent Skills spec (agentskills.io/specification) whitelists exactly 6 top-level frontmatter keys (name/description/license/compatibility/metadata/ allowed-tools). A top-level `origin` key fails the official validator (anthropics/skills quick_validate.py ALLOWED_PROPERTIES; skills-ref validate). This moves `origin: X` -> `metadata.origin: X` across the canonical skills/ tree, preserving each value verbatim. Frontmatter-only, minimal diff. - 251 SKILL.md updated (242 new metadata block, 9 appended to existing metadata) - origin values preserved verbatim (verified 251/251) - YAML validated on all changed files - scoped to canonical skills/ only (docs/<lang> translations + tool mirrors .cursor/.kiro/.agents left untouched; presumably regenerated from canonical) Addresses #2233
308 lines
12 KiB
Markdown
308 lines
12 KiB
Markdown
---
|
||
name: santa-method
|
||
description: "Multi-agent adversarial verification with convergence loop. Two independent review agents must both pass before output ships."
|
||
metadata:
|
||
origin: "Ronald Skelton - Founder, RapportScore.ai"
|
||
---
|
||
|
||
# Santa Method
|
||
|
||
Multi-agent adversarial verification framework. Make a list, check it twice. If it's naughty, fix it until it's nice.
|
||
|
||
The core insight: a single agent reviewing its own output shares the same biases, knowledge gaps, and systematic errors that produced the output. Two independent reviewers with no shared context break this failure mode.
|
||
|
||
## When to Activate
|
||
|
||
Invoke this skill when:
|
||
- Output will be published, deployed, or consumed by end users
|
||
- Compliance, regulatory, or brand constraints must be enforced
|
||
- Code ships to production without human review
|
||
- Content accuracy matters (technical docs, educational material, customer-facing copy)
|
||
- Batch generation at scale where spot-checking misses systemic patterns
|
||
- Hallucination risk is elevated (claims, statistics, API references, legal language)
|
||
|
||
Do NOT use for internal drafts, exploratory research, or tasks with deterministic verification (use build/test/lint pipelines for those).
|
||
|
||
## Architecture
|
||
|
||
```
|
||
┌─────────────┐
|
||
│ GENERATOR │ Phase 1: Make a List
|
||
│ (Agent A) │ Produce the deliverable
|
||
└──────┬───────┘
|
||
│ output
|
||
▼
|
||
┌──────────────────────────────┐
|
||
│ DUAL INDEPENDENT REVIEW │ Phase 2: Check It Twice
|
||
│ │
|
||
│ ┌───────────┐ ┌───────────┐ │ Two agents, same rubric,
|
||
│ │ Reviewer B │ │ Reviewer C │ │ no shared context
|
||
│ └─────┬─────┘ └─────┬─────┘ │
|
||
│ │ │ │
|
||
└────────┼──────────────┼────────┘
|
||
│ │
|
||
▼ ▼
|
||
┌──────────────────────────────┐
|
||
│ VERDICT GATE │ Phase 3: Naughty or Nice
|
||
│ │
|
||
│ B passes AND C passes → NICE │ Both must pass.
|
||
│ Otherwise → NAUGHTY │ No exceptions.
|
||
└──────┬──────────────┬─────────┘
|
||
│ │
|
||
NICE NAUGHTY
|
||
│ │
|
||
▼ ▼
|
||
[ SHIP ] ┌─────────────┐
|
||
│ FIX CYCLE │ Phase 4: Fix Until Nice
|
||
│ │
|
||
│ iteration++ │ Collect all flags.
|
||
│ if i > MAX: │ Fix all issues.
|
||
│ escalate │ Re-run both reviewers.
|
||
│ else: │ Loop until convergence.
|
||
│ goto Ph.2 │
|
||
└──────────────┘
|
||
```
|
||
|
||
## Phase Details
|
||
|
||
### Phase 1: Make a List (Generate)
|
||
|
||
Execute the primary task. No changes to your normal generation workflow. Santa Method is a post-generation verification layer, not a generation strategy.
|
||
|
||
```python
|
||
# The generator runs as normal
|
||
output = generate(task_spec)
|
||
```
|
||
|
||
### Phase 2: Check It Twice (Independent Dual Review)
|
||
|
||
Spawn two review agents in parallel. Critical invariants:
|
||
|
||
1. **Context isolation** — neither reviewer sees the other's assessment
|
||
2. **Identical rubric** — both receive the same evaluation criteria
|
||
3. **Same inputs** — both receive the original spec AND the generated output
|
||
4. **Structured output** — each returns a typed verdict, not prose
|
||
|
||
```python
|
||
REVIEWER_PROMPT = """
|
||
You are an independent quality reviewer. You have NOT seen any other review of this output.
|
||
|
||
## Task Specification
|
||
{task_spec}
|
||
|
||
## Output Under Review
|
||
{output}
|
||
|
||
## Evaluation Rubric
|
||
{rubric}
|
||
|
||
## Instructions
|
||
Evaluate the output against EACH rubric criterion. For each:
|
||
- PASS: criterion fully met, no issues
|
||
- FAIL: specific issue found (cite the exact problem)
|
||
|
||
Return your assessment as structured JSON:
|
||
{
|
||
"verdict": "PASS" | "FAIL",
|
||
"checks": [
|
||
{"criterion": "...", "result": "PASS|FAIL", "detail": "..."}
|
||
],
|
||
"critical_issues": ["..."], // blockers that must be fixed
|
||
"suggestions": ["..."] // non-blocking improvements
|
||
}
|
||
|
||
Be rigorous. Your job is to find problems, not to approve.
|
||
"""
|
||
```
|
||
|
||
```python
|
||
# Spawn reviewers in parallel (Claude Code subagents)
|
||
review_b = Agent(prompt=REVIEWER_PROMPT.format(...), description="Santa Reviewer B")
|
||
review_c = Agent(prompt=REVIEWER_PROMPT.format(...), description="Santa Reviewer C")
|
||
|
||
# Both run concurrently — neither sees the other
|
||
```
|
||
|
||
### Rubric Design
|
||
|
||
The rubric is the most important input. Vague rubrics produce vague reviews. Every criterion must have an objective pass/fail condition.
|
||
|
||
| Criterion | Pass Condition | Failure Signal |
|
||
|-----------|---------------|----------------|
|
||
| Factual accuracy | All claims verifiable against source material or common knowledge | Invented statistics, wrong version numbers, nonexistent APIs |
|
||
| Hallucination-free | No fabricated entities, quotes, URLs, or references | Links to pages that don't exist, attributed quotes with no source |
|
||
| Completeness | Every requirement in the spec is addressed | Missing sections, skipped edge cases, incomplete coverage |
|
||
| Compliance | Passes all project-specific constraints | Banned terms used, tone violations, regulatory non-compliance |
|
||
| Internal consistency | No contradictions within the output | Section A says X, section B says not-X |
|
||
| Technical correctness | Code compiles/runs, algorithms are sound | Syntax errors, logic bugs, wrong complexity claims |
|
||
|
||
#### Domain-Specific Rubric Extensions
|
||
|
||
**Content/Marketing:**
|
||
- Brand voice adherence
|
||
- SEO requirements met (keyword density, meta tags, structure)
|
||
- No competitor trademark misuse
|
||
- CTA present and correctly linked
|
||
|
||
**Code:**
|
||
- Type safety (no `any` leaks, proper null handling)
|
||
- Error handling coverage
|
||
- Security (no secrets in code, input validation, injection prevention)
|
||
- Test coverage for new paths
|
||
|
||
**Compliance-Sensitive (regulated, legal, financial):**
|
||
- No outcome guarantees or unsubstantiated claims
|
||
- Required disclaimers present
|
||
- Approved terminology only
|
||
- Jurisdiction-appropriate language
|
||
|
||
### Phase 3: Naughty or Nice (Verdict Gate)
|
||
|
||
```python
|
||
def santa_verdict(review_b, review_c):
|
||
"""Both reviewers must pass. No partial credit."""
|
||
if review_b.verdict == "PASS" and review_c.verdict == "PASS":
|
||
return "NICE" # Ship it
|
||
|
||
# Merge flags from both reviewers, deduplicate
|
||
all_issues = dedupe(review_b.critical_issues + review_c.critical_issues)
|
||
all_suggestions = dedupe(review_b.suggestions + review_c.suggestions)
|
||
|
||
return "NAUGHTY", all_issues, all_suggestions
|
||
```
|
||
|
||
Why both must pass: if only one reviewer catches an issue, that issue is real. The other reviewer's blind spot is exactly the failure mode Santa Method exists to eliminate.
|
||
|
||
### Phase 4: Fix Until Nice (Convergence Loop)
|
||
|
||
```python
|
||
MAX_ITERATIONS = 3
|
||
|
||
for iteration in range(MAX_ITERATIONS):
|
||
verdict, issues, suggestions = santa_verdict(review_b, review_c)
|
||
|
||
if verdict == "NICE":
|
||
log_santa_result(output, iteration, "passed")
|
||
return ship(output)
|
||
|
||
# Fix all critical issues (suggestions are optional)
|
||
output = fix_agent.execute(
|
||
output=output,
|
||
issues=issues,
|
||
instruction="Fix ONLY the flagged issues. Do not refactor or add unrequested changes."
|
||
)
|
||
|
||
# Re-run BOTH reviewers on fixed output (fresh agents, no memory of previous round)
|
||
review_b = Agent(prompt=REVIEWER_PROMPT.format(output=output, ...))
|
||
review_c = Agent(prompt=REVIEWER_PROMPT.format(output=output, ...))
|
||
|
||
# Exhausted iterations — escalate
|
||
log_santa_result(output, MAX_ITERATIONS, "escalated")
|
||
escalate_to_human(output, issues)
|
||
```
|
||
|
||
Critical: each review round uses **fresh agents**. Reviewers must not carry memory from previous rounds, as prior context creates anchoring bias.
|
||
|
||
## Implementation Patterns
|
||
|
||
### Pattern A: Claude Code Subagents (Recommended)
|
||
|
||
Subagents provide true context isolation. Each reviewer is a separate process with no shared state.
|
||
|
||
```bash
|
||
# In a Claude Code session, use the Agent tool to spawn reviewers
|
||
# Both agents run in parallel for speed
|
||
```
|
||
|
||
```python
|
||
# Pseudocode for Agent tool invocation
|
||
reviewer_b = Agent(
|
||
description="Santa Review B",
|
||
prompt=f"Review this output for quality...\n\nRUBRIC:\n{rubric}\n\nOUTPUT:\n{output}"
|
||
)
|
||
reviewer_c = Agent(
|
||
description="Santa Review C",
|
||
prompt=f"Review this output for quality...\n\nRUBRIC:\n{rubric}\n\nOUTPUT:\n{output}"
|
||
)
|
||
```
|
||
|
||
### Pattern B: Sequential Inline (Fallback)
|
||
|
||
When subagents aren't available, simulate isolation with explicit context resets:
|
||
|
||
1. Generate output
|
||
2. New context: "You are Reviewer 1. Evaluate ONLY against this rubric. Find problems."
|
||
3. Record findings verbatim
|
||
4. Clear context completely
|
||
5. New context: "You are Reviewer 2. Evaluate ONLY against this rubric. Find problems."
|
||
6. Compare both reviews, fix, repeat
|
||
|
||
The subagent pattern is strictly superior — inline simulation risks context bleed between reviewers.
|
||
|
||
### Pattern C: Batch Sampling
|
||
|
||
For large batches (100+ items), full Santa on every item is cost-prohibitive. Use stratified sampling:
|
||
|
||
1. Run Santa on a random sample (10-15% of batch, minimum 5 items)
|
||
2. Categorize failures by type (hallucination, compliance, completeness, etc.)
|
||
3. If systematic patterns emerge, apply targeted fixes to the entire batch
|
||
4. Re-sample and re-verify the fixed batch
|
||
5. Continue until a clean sample passes
|
||
|
||
```python
|
||
import random
|
||
|
||
def santa_batch(items, rubric, sample_rate=0.15):
|
||
sample = random.sample(items, max(5, int(len(items) * sample_rate)))
|
||
|
||
for item in sample:
|
||
result = santa_full(item, rubric)
|
||
if result.verdict == "NAUGHTY":
|
||
pattern = classify_failure(result.issues)
|
||
items = batch_fix(items, pattern) # Fix all items matching pattern
|
||
return santa_batch(items, rubric) # Re-sample
|
||
|
||
return items # Clean sample → ship batch
|
||
```
|
||
|
||
## Failure Modes and Mitigations
|
||
|
||
| Failure Mode | Symptom | Mitigation |
|
||
|-------------|---------|------------|
|
||
| Infinite loop | Reviewers keep finding new issues after fixes | Max iteration cap (3). Escalate. |
|
||
| Rubber stamping | Both reviewers pass everything | Adversarial prompt: "Your job is to find problems, not approve." |
|
||
| Subjective drift | Reviewers flag style preferences, not errors | Tight rubric with objective pass/fail criteria only |
|
||
| Fix regression | Fixing issue A introduces issue B | Fresh reviewers each round catch regressions |
|
||
| Reviewer agreement bias | Both reviewers miss the same thing | Mitigated by independence, not eliminated. For critical output, add a third reviewer or human spot-check. |
|
||
| Cost explosion | Too many iterations on large outputs | Batch sampling pattern. Budget caps per verification cycle. |
|
||
|
||
## Integration with Other Skills
|
||
|
||
| Skill | Relationship |
|
||
|-------|-------------|
|
||
| Verification Loop | Use for deterministic checks (build, lint, test). Santa for semantic checks (accuracy, hallucinations). Run verification-loop first, Santa second. |
|
||
| Eval Harness | Santa Method results feed eval metrics. Track pass@k across Santa runs to measure generator quality over time. |
|
||
| Continuous Learning v2 | Santa findings become instincts. Repeated failures on the same criterion → learned behavior to avoid the pattern. |
|
||
| Strategic Compact | Run Santa BEFORE compacting. Don't lose review context mid-verification. |
|
||
|
||
## Metrics
|
||
|
||
Track these to measure Santa Method effectiveness:
|
||
|
||
- **First-pass rate**: % of outputs that pass Santa on round 1 (target: >70%)
|
||
- **Mean iterations to convergence**: average rounds to NICE (target: <1.5)
|
||
- **Issue taxonomy**: distribution of failure types (hallucination vs. completeness vs. compliance)
|
||
- **Reviewer agreement**: % of issues flagged by both reviewers vs. only one (low agreement = rubric needs tightening)
|
||
- **Escape rate**: issues found post-ship that Santa should have caught (target: 0)
|
||
|
||
## Cost Analysis
|
||
|
||
Santa Method costs approximately 2-3x the token cost of generation alone per verification cycle. For most high-stakes output, this is a bargain:
|
||
|
||
```
|
||
Cost of Santa = (generation tokens) + 2×(review tokens per round) × (avg rounds)
|
||
Cost of NOT Santa = (reputation damage) + (correction effort) + (trust erosion)
|
||
```
|
||
|
||
For batch operations, the sampling pattern reduces cost to ~15-20% of full verification while catching >90% of systematic issues.
|