mirror of
https://github.com/affaan-m/everything-claude-code.git
synced 2026-05-13 18:00:35 +08:00
feat: add machine learning engineering workflow
(cherry picked from commit 4b0eeacd66b2f65b7b11d7f2c8bef056c50b08e4)
This commit is contained in:
parent
54efa1a150
commit
240d52d27f
344
.agents/skills/mle-workflow/SKILL.md
Normal file
344
.agents/skills/mle-workflow/SKILL.md
Normal file
@ -0,0 +1,344 @@
|
||||
---
|
||||
name: mle-workflow
|
||||
description: Production machine-learning engineering workflow for data contracts, reproducible training, model evaluation, deployment, monitoring, and rollback. Use when building, reviewing, or hardening ML systems beyond one-off notebooks.
|
||||
allowed-tools: Read, Write, Edit, Bash, Grep, Glob
|
||||
---
|
||||
|
||||
# Machine Learning Engineering Workflow
|
||||
|
||||
Use this skill to turn model work into a production ML system with clear data contracts, repeatable training, measurable quality gates, deployable artifacts, and operational monitoring.
|
||||
|
||||
## When to Activate
|
||||
|
||||
- Planning or reviewing a production ML feature, model refresh, ranking system, recommender, classifier, embedding workflow, or forecasting pipeline
|
||||
- Converting notebook code into a reusable training, evaluation, batch inference, or online inference pipeline
|
||||
- Designing model promotion criteria, offline/online evals, experiment tracking, or rollback paths
|
||||
- Debugging failures caused by data drift, label leakage, stale features, artifact mismatch, or inconsistent training and serving logic
|
||||
- Adding model monitoring, canary rollout, shadow traffic, or post-deploy quality checks
|
||||
|
||||
## Scope Calibration
|
||||
|
||||
Use only the lanes that fit the system in front of you. This skill is useful for ranking, search, recommendations, classifiers, forecasting, embeddings, LLM workflows, anomaly detection, and batch analytics, but it should not force one architecture onto all of them.
|
||||
|
||||
- Do not assume every model has supervised labels, online serving, a feature store, PyTorch, GPUs, human review, A/B tests, or real-time feedback.
|
||||
- Do not add heavyweight MLOps machinery when a data contract, baseline, eval script, and rollback note would make the change reviewable.
|
||||
- Do make assumptions explicit when the project lacks labels, delayed outcomes, slice definitions, production traffic, or monitoring ownership.
|
||||
- Treat examples as interchangeable scaffolds. Replace metrics, serving mode, data stores, and rollout mechanics with the project-native equivalents.
|
||||
|
||||
## Related Skills
|
||||
|
||||
- `python-patterns` and `python-testing` for Python implementation and pytest coverage
|
||||
- `pytorch-patterns` for deep learning models, data loaders, device handling, and training loops
|
||||
- `eval-harness` and `ai-regression-testing` for promotion gates and agent-assisted regression checks
|
||||
- `database-migrations`, `postgres-patterns`, and `clickhouse-io` for data storage and analytics surfaces
|
||||
- `deployment-patterns`, `docker-patterns`, and `security-review` for serving, secrets, containers, and production hardening
|
||||
|
||||
## Reuse the SWE Surface
|
||||
|
||||
Do not treat MLE as separate from software engineering. Most ECC SWE workflows apply directly to ML systems, often with stricter failure modes:
|
||||
|
||||
| SWE surface | MLE use |
|
||||
|-------------|---------|
|
||||
| `product-capability` / `architecture-decision-records` | Turn model work into explicit product contracts and record irreversible data, model, and rollout choices |
|
||||
| `repo-scan` / `codebase-onboarding` / `code-tour` | Find existing training, feature, serving, eval, and monitoring paths before introducing a parallel ML stack |
|
||||
| `plan` / `feature-dev` | Scope model changes as product capabilities with data, eval, serving, and rollback phases |
|
||||
| `tdd-workflow` / `python-testing` | Test feature transforms, split logic, metric calculations, artifact loading, and inference schemas before implementation |
|
||||
| `code-reviewer` / `mle-reviewer` | Review code quality plus ML-specific leakage, reproducibility, promotion, and monitoring risks |
|
||||
| `build-fix` / `pr-test-analyzer` | Diagnose broken CI, flaky evals, missing fixtures, and environment-specific model or dependency failures |
|
||||
| `quality-gate` / `test-coverage` | Require automated evidence for transforms, metrics, inference contracts, promotion gates, and rollback behavior |
|
||||
| `eval-harness` / `verification-loop` | Turn offline metrics, slice checks, latency budgets, and rollback drills into repeatable gates |
|
||||
| `ai-regression-testing` | Preserve every production bug as a regression: missing feature, stale label, bad artifact, schema drift, or serving mismatch |
|
||||
| `api-design` / `backend-patterns` | Design prediction APIs, batch jobs, idempotent retraining endpoints, and response envelopes |
|
||||
| `database-migrations` / `postgres-patterns` / `clickhouse-io` | Version labels, feature snapshots, prediction logs, experiment metrics, and drift analytics |
|
||||
| `deployment-patterns` / `docker-patterns` | Package reproducible training and serving images with health checks, resource limits, and rollback |
|
||||
| `canary-watch` / `dashboard-builder` | Make rollout health visible with model-version, slice, drift, latency, cost, and delayed-label dashboards |
|
||||
| `security-review` / `security-scan` | Check model artifacts, notebooks, prompts, datasets, and logs for secrets, PII, unsafe deserialization, and supply-chain risk |
|
||||
| `e2e-testing` / `browser-qa` / `accessibility` | Test critical product flows that consume predictions, including explainability and fallback UI states |
|
||||
| `benchmark` / `performance-optimizer` | Measure throughput, p95 latency, memory, GPU utilization, and cost per prediction or retrain |
|
||||
| `cost-aware-llm-pipeline` / `token-budget-advisor` | Route LLM/embedding workloads by quality, latency, and budget instead of defaulting to the largest model |
|
||||
| `documentation-lookup` / `search-first` | Verify current library behavior for model serving, feature stores, vector DBs, and eval tooling before coding |
|
||||
| `git-workflow` / `github-ops` / `opensource-pipeline` | Package MLE changes for review with crisp scope, generated artifacts excluded, and reproducible test evidence |
|
||||
| `strategic-compact` / `dmux-workflows` | Split long ML work into parallel tracks: data contract, eval harness, serving path, monitoring, and docs |
|
||||
|
||||
## Ten MLE Task Simulations
|
||||
|
||||
Use these simulations as coverage checks when planning or reviewing MLE work. A strong MLE workflow should reduce each task to explicit contracts, reusable SWE surfaces, automated evidence, and a reviewable artifact.
|
||||
|
||||
| ID | Common MLE task | Streamlined ECC path | Required output | Pipeline lanes covered |
|
||||
|----|-----------------|----------------------|-----------------|------------------------|
|
||||
| MLE-01 | Frame an ambiguous prediction, ranking, recommender, classifier, embedding, or forecast capability | `product-capability`, `plan`, `architecture-decision-records`, `mle-workflow` | Iteration Compact naming who cares, decision owner, success metric, unacceptable mistakes, assumptions, constraints, and first experiment | product contract, stakeholder loss, risk, rollout |
|
||||
| MLE-02 | Define metric goals, labels, data sources, and the mistake budget | `repo-scan`, `database-reviewer`, `database-migrations`, `postgres-patterns`, `clickhouse-io` | Data and metric contract with entity grain, label timing, label confidence, feature timing, point-in-time joins, split policy, and dataset snapshot | data contract, metric design, leakage, reproducibility |
|
||||
| MLE-03 | Build a baseline model and scoring path before adding complexity | `tdd-workflow`, `python-testing`, `python-patterns`, `code-reviewer` | Baseline scorer with confusion matrix, calibration notes, latency/cost estimate, known weaknesses, and tests for score shape and determinism | baseline, scoring, testing, serving parity |
|
||||
| MLE-04 | Generate features from hypotheses about what separates outcomes | `python-patterns`, `pytorch-patterns`, `docker-patterns`, `deployment-patterns` | Feature plan and transform module covering signal source, missing values, outliers, correlations, leakage checks, and train/serve equivalence | feature pipeline, leakage, training, artifacts |
|
||||
| MLE-05 | Tune thresholds, configs, and model complexity under tradeoffs | `eval-harness`, `ai-regression-testing`, `quality-gate`, `test-coverage` | Threshold/config report comparing precision, recall, F1, AUC, calibration, group slices, latency, cost, complexity, and acceptable error classes | evaluation, threshold, promotion, regression |
|
||||
| MLE-06 | Run error analysis and turn mistakes into the next experiment | `eval-harness`, `ai-regression-testing`, `mle-reviewer`, `silent-failure-hunter` | Error cluster report for false positives, false negatives, ambiguous labels, stale features, missing signals, and bug traces with lessons captured | error analysis, bug trace, iteration, regression |
|
||||
| MLE-07 | Package a model artifact for batch or online inference | `api-design`, `backend-patterns`, `security-review`, `security-scan` | Versioned artifact bundle with preprocessing, config, dependency constraints, schema validation, safe loading, and PII-safe logs | artifact, security, inference contract |
|
||||
| MLE-08 | Ship online serving or batch scoring with feedback capture | `api-design`, `backend-patterns`, `e2e-testing`, `browser-qa`, `accessibility` | Prediction endpoint or batch job with response envelope, timeout, batching, fallback, model version, confidence, feedback logging, and product-flow tests | serving, batch inference, fallback, user workflow |
|
||||
| MLE-09 | Roll out a model with shadow traffic, canary, A/B test, or rollback | `canary-watch`, `dashboard-builder`, `verification-loop`, `performance-optimizer` | Rollout plan naming traffic split, dashboards, p95 latency, cost, quality guardrails, rollback artifact, and rollback trigger | deployment, canary, rollback |
|
||||
| MLE-10 | Operate, debug, and refresh a production model after launch | `silent-failure-hunter`, `dashboard-builder`, `mle-reviewer`, `doc-updater`, `github-ops` | Observation ledger and refresh plan with drift checks, delayed-label health, alert owners, runbook updates, retrain criteria, and PR evidence | monitoring, incident response, retraining |
|
||||
|
||||
## Iteration Compact
|
||||
|
||||
Before touching model code, compress the work into one reviewable artifact. This should be short enough to fit in a PR description and precise enough that another engineer can challenge the tradeoffs.
|
||||
|
||||
```text
|
||||
Goal:
|
||||
Who cares:
|
||||
Decision owner:
|
||||
User or system action changed by the model:
|
||||
Success metric:
|
||||
Guardrail metrics:
|
||||
Mistake budget:
|
||||
Unacceptable mistakes:
|
||||
Acceptable mistakes:
|
||||
Assumptions:
|
||||
Constraints:
|
||||
Labels and data snapshot:
|
||||
Baseline:
|
||||
Candidate signals:
|
||||
Threshold or config plan:
|
||||
Eval slices:
|
||||
Known risks:
|
||||
Next experiment:
|
||||
Rollback or fallback:
|
||||
```
|
||||
|
||||
This compact is the MLE equivalent of a strong SWE design note. It keeps the team from optimizing a metric no one trusts, adding features that do not address the real error mode, or shipping complexity without a rollback.
|
||||
|
||||
## Decision Brain
|
||||
|
||||
Use this loop whenever the task is ambiguous, high-impact, or metric-heavy:
|
||||
|
||||
1. Start from the decision, not the model. Name the action that changes downstream behavior.
|
||||
2. Name who cares and why. Different stakeholders pay different costs for false positives, false negatives, latency, compute spend, opacity, or missed opportunities.
|
||||
3. Convert ambiguity into hypotheses. Ask what signal would separate outcomes, what evidence would disprove it, and what simple baseline should be hard to beat.
|
||||
4. Research prior art or a nearby known problem before inventing a bespoke system.
|
||||
5. Score choices with `(probability, confidence) x (cost, severity, importance, impact)`.
|
||||
6. Consider adversarial behavior, incentives, selective disclosure, distribution shift, and feedback loops.
|
||||
7. Prefer the simplest change that reduces the most important mistake. Simplicity is not laziness; it is a way to minimize blunders while preserving iteration speed.
|
||||
8. Capture the decision, evidence, counterargument, and next reversible step.
|
||||
|
||||
## Metric and Mistake Economics
|
||||
|
||||
Choose metrics from failure costs, not habit:
|
||||
|
||||
- Use a confusion matrix early so the team can discuss concrete false positives and false negatives instead of abstract accuracy.
|
||||
- Favor precision when the cost of an incorrect positive decision dominates.
|
||||
- Favor recall when the cost of a missed positive dominates.
|
||||
- Use F1 only when the precision/recall tradeoff is genuinely balanced and explainable.
|
||||
- Use AUC or ranking metrics when ordering quality matters more than a single threshold.
|
||||
- Track latency, throughput, memory, and cost as first-class metrics because they shape feasible model complexity.
|
||||
- Compare against a baseline and the current production model before celebrating an offline gain.
|
||||
- Treat real-world feedback signals as delayed labels with bias, lag, and coverage gaps; do not treat them as ground truth without analysis.
|
||||
|
||||
Every metric choice should state which mistake it makes cheaper, which mistake it makes more likely, and who absorbs that cost.
|
||||
|
||||
## Data and Feature Hypotheses
|
||||
|
||||
Features should come from a theory of separation:
|
||||
|
||||
- Text, categorical fields, numeric histories, graph relationships, recency, frequency, and aggregates are candidate signal families, not automatic features.
|
||||
- For every feature family, state why it should separate outcomes and how it could leak future information.
|
||||
- For noisy labels, consider adjudication, label confidence, soft targets, or confidence weighting.
|
||||
- For class imbalance, compare weighted loss, resampling, threshold movement, and calibrated decision rules.
|
||||
- For missing values, decide whether absence is informative, imputable, or a reason to abstain.
|
||||
- For outliers, decide whether to clip, bucket, investigate, or preserve them as rare but important signal.
|
||||
- For correlated features, check whether they are redundant, unstable, or proxies for unavailable future state.
|
||||
|
||||
Do not add model complexity until error analysis shows that the baseline is failing for a reason additional signal or capacity can plausibly fix.
|
||||
|
||||
## Error Analysis Loop
|
||||
|
||||
After each baseline, training run, threshold change, or config change:
|
||||
|
||||
1. Split mistakes into false positives, false negatives, abstentions, low-confidence cases, and system failures.
|
||||
2. Cluster errors by shared traits: language, entity type, source, time, geography, device, sparsity, recency, feature freshness, label source, or model version.
|
||||
3. Separate model mistakes from data bugs, label ambiguity, product ambiguity, instrumentation gaps, and serving mismatches.
|
||||
4. Trace each major cluster to one of four moves: better labels, better features, better threshold/config, or better product fallback.
|
||||
5. Preserve every important mistake as a regression test, eval slice, dashboard panel, or runbook entry.
|
||||
6. Write the next iteration as a falsifiable experiment, not a vague "improve model" task.
|
||||
|
||||
The strongest MLE loop is not train -> metric -> ship. It is mistake -> cluster -> hypothesis -> experiment -> evidence -> simpler system.
|
||||
|
||||
## Observation Ledger
|
||||
|
||||
Keep a compact decision and evidence trail beside the code, PR, experiment report, or runbook:
|
||||
|
||||
```text
|
||||
Iteration:
|
||||
Change:
|
||||
Why this mattered:
|
||||
Metric movement:
|
||||
Slice movement:
|
||||
False positives:
|
||||
False negatives:
|
||||
Unexpected errors:
|
||||
Decision:
|
||||
Tradeoff accepted:
|
||||
Lesson captured:
|
||||
Regression added:
|
||||
Debt created:
|
||||
Next iteration:
|
||||
```
|
||||
|
||||
Use the ledger to make model work cumulative. The goal is for each iteration to make the next decision easier, not merely to produce another artifact.
|
||||
|
||||
## Core Workflow
|
||||
|
||||
### 1. Define the Prediction Contract
|
||||
|
||||
Capture the product-level contract before writing model code:
|
||||
|
||||
- Prediction target and decision owner
|
||||
- Input entity, output schema, confidence/calibration fields, and allowed latency
|
||||
- Batch, online, streaming, or hybrid serving mode
|
||||
- Fallback behavior when the model, feature store, or dependency is unavailable
|
||||
- Human review or override path for high-impact decisions
|
||||
- Privacy, retention, and audit requirements for inputs, predictions, and labels
|
||||
|
||||
Do not accept "improve the model" as a requirement. Tie the model to an observable product behavior and a measurable acceptance gate.
|
||||
|
||||
### 2. Lock the Data Contract
|
||||
|
||||
Every ML task needs an explicit data contract:
|
||||
|
||||
- Entity grain and primary key
|
||||
- Label definition, label timestamp, and label availability delay
|
||||
- Feature timestamp, freshness SLA, and point-in-time join rules
|
||||
- Train, validation, test, and backtest split policy
|
||||
- Required columns, allowed nulls, ranges, categories, and units
|
||||
- PII or sensitive fields that must not enter training artifacts or logs
|
||||
- Dataset version or snapshot ID for reproducibility
|
||||
|
||||
Guard against leakage first. If a feature is not available at prediction time, or is joined using future information, remove it or move it to an analysis-only path.
|
||||
|
||||
### 3. Build a Reproducible Pipeline
|
||||
|
||||
Training code should be runnable by another engineer without hidden notebook state:
|
||||
|
||||
- Use typed config files or dataclasses for all hyperparameters and paths
|
||||
- Pin package and model dependencies
|
||||
- Set random seeds and document any nondeterministic GPU behavior
|
||||
- Record dataset version, code SHA, config hash, metrics, and artifact URI
|
||||
- Save preprocessing logic with the model artifact, not separately in a notebook
|
||||
- Keep train, eval, and inference transformations shared or generated from one source
|
||||
- Make every step idempotent so retries do not corrupt artifacts or metrics
|
||||
|
||||
Prefer immutable values and pure transformation functions. Avoid mutating shared data frames or global config during feature generation.
|
||||
|
||||
```python
|
||||
import hashlib
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class TrainingConfig:
|
||||
dataset_uri: str
|
||||
model_dir: Path
|
||||
seed: int
|
||||
learning_rate: float
|
||||
batch_size: int
|
||||
|
||||
|
||||
def artifact_name(config: TrainingConfig, code_sha: str) -> str:
|
||||
config_key = f"{config.dataset_uri}:{config.seed}:{config.learning_rate}:{config.batch_size}"
|
||||
config_hash = hashlib.sha256(config_key.encode("utf-8")).hexdigest()[:12]
|
||||
return f"{code_sha[:12]}-{config_hash}"
|
||||
```
|
||||
|
||||
### 4. Evaluate Before Promotion
|
||||
|
||||
Promotion criteria should be declared before training finishes:
|
||||
|
||||
- Baseline model and current production model comparison
|
||||
- Primary metric aligned to product behavior
|
||||
- Guardrail metrics for latency, calibration, fairness slices, cost, and error concentration
|
||||
- Slice metrics for important cohorts, geographies, devices, languages, or data sources
|
||||
- Confidence intervals or repeated-run variance when metrics are noisy
|
||||
- Failure examples reviewed by a human for high-impact models
|
||||
- Explicit "do not ship" thresholds
|
||||
|
||||
```python
|
||||
PROMOTION_GATES = {
|
||||
"auc": ("min", 0.82),
|
||||
"calibration_error": ("max", 0.04),
|
||||
"p95_latency_ms": ("max", 80),
|
||||
}
|
||||
|
||||
|
||||
def assert_promotion_ready(metrics: dict[str, float]) -> None:
|
||||
missing = sorted(name for name in PROMOTION_GATES if name not in metrics)
|
||||
if missing:
|
||||
raise ValueError(f"Model promotion metrics missing required gates: {missing}")
|
||||
|
||||
failures = {
|
||||
name: value
|
||||
for name, (direction, threshold) in PROMOTION_GATES.items()
|
||||
for value in [metrics[name]]
|
||||
if (direction == "min" and value < threshold)
|
||||
or (direction == "max" and value > threshold)
|
||||
}
|
||||
if failures:
|
||||
raise ValueError(f"Model failed promotion gates: {failures}")
|
||||
```
|
||||
|
||||
Use offline metrics as gates, not guarantees. When the model changes product behavior, plan shadow evaluation, canary rollout, or A/B testing before full rollout.
|
||||
|
||||
### 5. Package for Serving
|
||||
|
||||
An ML artifact is production-ready only when the serving contract is testable:
|
||||
|
||||
- Model artifact includes version, training data reference, config, and preprocessing
|
||||
- Input schema rejects invalid, stale, or out-of-range features
|
||||
- Output schema includes model version and confidence or explanation fields when useful
|
||||
- Serving path has timeout, batching, resource limits, and fallback behavior
|
||||
- CPU/GPU requirements are explicit and tested
|
||||
- Prediction logs avoid PII and include enough identifiers for debugging and label joins
|
||||
- Integration tests cover missing features, stale features, bad types, empty batches, and fallback path
|
||||
|
||||
Never let training-only feature code diverge from serving feature code without a test that proves equivalence.
|
||||
|
||||
### 6. Operate the Model
|
||||
|
||||
Model monitoring needs both system and quality signals:
|
||||
|
||||
- Availability, error rate, timeout rate, queue depth, and p50/p95/p99 latency
|
||||
- Feature null rate, range drift, categorical drift, and freshness drift
|
||||
- Prediction distribution drift and confidence distribution drift
|
||||
- Label arrival health and delayed quality metrics
|
||||
- Business KPI guardrails and rollback triggers
|
||||
- Per-version dashboards for canaries and rollbacks
|
||||
|
||||
Every deployment should have a rollback plan that names the previous artifact, config, data dependency, and traffic-switch mechanism.
|
||||
|
||||
## Review Checklist
|
||||
|
||||
- [ ] Prediction contract is explicit and testable
|
||||
- [ ] Data contract defines entity grain, label timing, feature timing, and snapshot/version
|
||||
- [ ] Leakage risks were checked against prediction-time availability
|
||||
- [ ] Training is reproducible from code, config, data version, and seed
|
||||
- [ ] Metrics compare against baseline and current production model
|
||||
- [ ] Slice metrics and guardrails are included for high-risk cohorts
|
||||
- [ ] Promotion gates are automated and fail closed
|
||||
- [ ] Training and serving transformations are shared or equivalence-tested
|
||||
- [ ] Model artifact carries version, config, dataset reference, and preprocessing
|
||||
- [ ] Serving path validates inputs and has timeout, fallback, and rollback behavior
|
||||
- [ ] Monitoring covers system health, feature drift, prediction drift, and delayed labels
|
||||
- [ ] Sensitive data is excluded from artifacts, logs, prompts, and examples
|
||||
|
||||
## Anti-Patterns
|
||||
|
||||
- Notebook state is required to reproduce the model
|
||||
- Random split leaks future data into validation or test sets
|
||||
- Feature joins ignore event time and label availability
|
||||
- Offline metric improves while important slices regress
|
||||
- Thresholds are tuned on the test set repeatedly
|
||||
- Training preprocessing is copied manually into serving code
|
||||
- Model version is missing from prediction logs
|
||||
- Monitoring only checks service uptime, not data or prediction quality
|
||||
- Rollback requires retraining instead of switching to a known-good artifact
|
||||
|
||||
## Output Expectations
|
||||
|
||||
When using this skill, return concrete artifacts: data contract, promotion gates, pipeline steps, test plan, deployment plan, or review findings. Call out unknowns that block production readiness instead of filling them with assumptions.
|
||||
7
.agents/skills/mle-workflow/agents/openai.yaml
Normal file
7
.agents/skills/mle-workflow/agents/openai.yaml
Normal file
@ -0,0 +1,7 @@
|
||||
interface:
|
||||
display_name: "MLE Workflow"
|
||||
short_description: "Production ML workflow and review gates"
|
||||
brand_color: "#2563EB"
|
||||
default_prompt: "Use $mle-workflow to plan or review a production ML pipeline."
|
||||
policy:
|
||||
allow_implicit_invocation: true
|
||||
@ -11,7 +11,7 @@
|
||||
{
|
||||
"name": "ecc",
|
||||
"source": "./",
|
||||
"description": "The most comprehensive Claude Code plugin — 53 agents, 203 skills, 69 legacy command shims, selective install profiles, and production-ready hooks for TDD, security scanning, code review, and continuous learning",
|
||||
"description": "The most comprehensive Claude Code plugin — 54 agents, 204 skills, 69 legacy command shims, selective install profiles, and production-ready hooks for TDD, security scanning, code review, and continuous learning",
|
||||
"version": "2.0.0-rc.1",
|
||||
"author": {
|
||||
"name": "Affaan Mustafa",
|
||||
|
||||
@ -1,7 +1,7 @@
|
||||
{
|
||||
"name": "ecc",
|
||||
"version": "2.0.0-rc.1",
|
||||
"description": "Battle-tested Claude Code plugin for engineering teams — 53 agents, 203 skills, 69 legacy command shims, production-ready hooks, and selective install workflows evolved through continuous real-world use",
|
||||
"description": "Battle-tested Claude Code plugin for engineering teams — 54 agents, 204 skills, 69 legacy command shims, production-ready hooks, and selective install workflows evolved through continuous real-world use",
|
||||
"author": {
|
||||
"name": "Affaan Mustafa",
|
||||
"url": "https://x.com/affaanmustafa"
|
||||
|
||||
@ -1,6 +1,6 @@
|
||||
# Everything Claude Code (ECC) — Agent Instructions
|
||||
|
||||
This is a **production-ready AI coding plugin** providing 53 specialized agents, 203 skills, 69 commands, and automated hook workflows for software development.
|
||||
This is a **production-ready AI coding plugin** providing 54 specialized agents, 204 skills, 69 commands, and automated hook workflows for software development.
|
||||
|
||||
**Version:** 2.0.0-rc.1
|
||||
|
||||
@ -41,6 +41,7 @@ This is a **production-ready AI coding plugin** providing 53 specialized agents,
|
||||
| rust-reviewer | Rust code review | Rust projects |
|
||||
| rust-build-resolver | Rust build errors | Rust build failures |
|
||||
| pytorch-build-resolver | PyTorch runtime/CUDA/training errors | PyTorch build/training failures |
|
||||
| mle-reviewer | Production ML pipeline review | ML pipelines, evals, serving, monitoring, rollback |
|
||||
| typescript-reviewer | TypeScript/JavaScript code review | TypeScript/JavaScript projects |
|
||||
|
||||
## Agent Orchestration
|
||||
@ -145,8 +146,8 @@ Troubleshoot failures: check test isolation → verify mocks → fix implementat
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
agents/ — 53 specialized subagents
|
||||
skills/ — 203 workflow skills and domain knowledge
|
||||
agents/ — 54 specialized subagents
|
||||
skills/ — 204 workflow skills and domain knowledge
|
||||
commands/ — 69 slash commands
|
||||
hooks/ — Trigger-based automations
|
||||
rules/ — Always-follow guidelines (common + per-language)
|
||||
|
||||
24
README.md
24
README.md
@ -89,7 +89,7 @@ This repo is the raw code only. The guides explain everything.
|
||||
### v2.0.0-rc.1 — Surface Refresh, Operator Workflows, and ECC 2.0 Alpha (Apr 2026)
|
||||
|
||||
- **Dashboard GUI** — New Tkinter-based desktop application (`ecc_dashboard.py` or `npm run dashboard`) with dark/light theme toggle, font customization, and project logo in header and taskbar.
|
||||
- **Public surface synced to the live repo** — metadata, catalog counts, plugin manifests, and install-facing docs now match the actual OSS surface: 53 agents, 203 skills, and 69 legacy command shims.
|
||||
- **Public surface synced to the live repo** — metadata, catalog counts, plugin manifests, and install-facing docs now match the actual OSS surface: 54 agents, 204 skills, and 69 legacy command shims.
|
||||
- **Operator and outbound workflow expansion** — `brand-voice`, `social-graph-ranker`, `connections-optimizer`, `customer-billing-ops`, `ecc-tools-cost-audit`, `google-workspace-ops`, `project-flow-ops`, and `workspace-surface-audit` round out the operator lane.
|
||||
- **Media and launch tooling** — `manim-video`, `remotion-video-creation`, and upgraded social publishing surfaces make technical explainers and launch content part of the same system.
|
||||
- **Framework and product surface growth** — `nestjs-patterns`, richer Codex/OpenCode install surfaces, and expanded cross-harness packaging keep the repo usable beyond Claude Code alone.
|
||||
@ -218,6 +218,13 @@ npx ecc consult "security reviews" --target claude
|
||||
|
||||
It returns matching components, related profiles, and preview/install commands. Use the preview command before installing if you want to inspect the exact file plan.
|
||||
|
||||
For production ML/MLOps workflows, keep the install opt-in and component-scoped:
|
||||
|
||||
```bash
|
||||
npx ecc consult "mlops training model deployment" --target claude
|
||||
npx ecc install --profile minimal --target claude --with capability:machine-learning
|
||||
```
|
||||
|
||||
### Step 1: Install the Plugin (Recommended)
|
||||
|
||||
> NOTE: The plugin is convenient, but the OSS installer below is still the most reliable path if your Claude Code build has trouble resolving self-hosted marketplace entries.
|
||||
@ -351,7 +358,7 @@ If you stacked methods, clean up in this order:
|
||||
/plugin list ecc@ecc
|
||||
```
|
||||
|
||||
**That's it!** You now have access to 53 agents, 203 skills, and 69 legacy command shims.
|
||||
**That's it!** You now have access to 54 agents, 204 skills, and 69 legacy command shims.
|
||||
|
||||
### Dashboard GUI
|
||||
|
||||
@ -449,7 +456,7 @@ everything-claude-code/
|
||||
| |-- plugin.json # Plugin metadata and component paths
|
||||
| |-- marketplace.json # Marketplace catalog for /plugin marketplace add
|
||||
|
|
||||
|-- agents/ # 53 specialized subagents for delegation
|
||||
|-- agents/ # 54 specialized subagents for delegation
|
||||
| |-- planner.md # Feature implementation planning
|
||||
| |-- architect.md # System design decisions
|
||||
| |-- tdd-guide.md # Test-driven development
|
||||
@ -477,6 +484,7 @@ everything-claude-code/
|
||||
| |-- rust-reviewer.md # Rust code review
|
||||
| |-- rust-build-resolver.md # Rust build error resolution
|
||||
| |-- pytorch-build-resolver.md # PyTorch/CUDA training errors
|
||||
| |-- mle-reviewer.md # Production ML pipeline, eval, serving, and monitoring review
|
||||
|
|
||||
|-- skills/ # Workflow definitions and domain knowledge
|
||||
| |-- coding-standards/ # Language best practices
|
||||
@ -538,6 +546,7 @@ everything-claude-code/
|
||||
| |-- liquid-glass-design/ # iOS 26 Liquid Glass design system (NEW)
|
||||
| |-- foundation-models-on-device/ # Apple on-device LLM with FoundationModels (NEW)
|
||||
| |-- swift-concurrency-6-2/ # Swift 6.2 Approachable Concurrency (NEW)
|
||||
| |-- mle-workflow/ # Production ML data contracts, evals, deployment, monitoring (NEW)
|
||||
| |-- perl-patterns/ # Modern Perl 5.36+ idioms and best practices (NEW)
|
||||
| |-- perl-security/ # Perl security patterns, taint mode, safe I/O (NEW)
|
||||
| |-- perl-testing/ # Perl TDD with Test2::V0, prove, Devel::Cover (NEW)
|
||||
@ -975,6 +984,7 @@ Not sure where to start? Use this quick reference. Skills are the canonical work
|
||||
| Review Python code | `/python-review` | python-reviewer |
|
||||
| Review TypeScript/JavaScript code | *(invoke `typescript-reviewer` directly)* | typescript-reviewer |
|
||||
| Audit database queries | *(auto-delegated)* | database-reviewer |
|
||||
| Review production ML changes | `mle-workflow` skill + `mle-reviewer` agent | mle-reviewer |
|
||||
|
||||
### Common Workflows
|
||||
|
||||
@ -1339,9 +1349,9 @@ The configuration is automatically detected from `.opencode/opencode.json`.
|
||||
|
||||
| Feature | Claude Code | OpenCode | Status |
|
||||
|---------|-------------|----------|--------|
|
||||
| Agents | PASS: 53 agents | PASS: 12 agents | **Claude Code leads** |
|
||||
| Agents | PASS: 54 agents | PASS: 12 agents | **Claude Code leads** |
|
||||
| Commands | PASS: 69 commands | PASS: 31 commands | **Claude Code leads** |
|
||||
| Skills | PASS: 203 skills | PASS: 37 skills | **Claude Code leads** |
|
||||
| Skills | PASS: 204 skills | PASS: 37 skills | **Claude Code leads** |
|
||||
| Hooks | PASS: 8 event types | PASS: 11 events | **OpenCode has more!** |
|
||||
| Rules | PASS: 29 rules | PASS: 13 instructions | **Claude Code leads** |
|
||||
| MCP Servers | PASS: 14 servers | PASS: Full | **Full parity** |
|
||||
@ -1444,9 +1454,9 @@ ECC is the **first plugin to maximize every major AI coding tool**. Here's how e
|
||||
|
||||
| Feature | Claude Code | Cursor IDE | Codex CLI | OpenCode |
|
||||
|---------|------------|------------|-----------|----------|
|
||||
| **Agents** | 53 | Shared (AGENTS.md) | Shared (AGENTS.md) | 12 |
|
||||
| **Agents** | 54 | Shared (AGENTS.md) | Shared (AGENTS.md) | 12 |
|
||||
| **Commands** | 69 | Shared | Instruction-based | 31 |
|
||||
| **Skills** | 203 | Shared | 10 (native format) | 37 |
|
||||
| **Skills** | 204 | Shared | 10 (native format) | 37 |
|
||||
| **Hook Events** | 8 types | 15 types | None yet | 11 types |
|
||||
| **Hook Scripts** | 20+ scripts | 16 scripts (DRY adapter) | N/A | Plugin hooks |
|
||||
| **Rules** | 34 (common + lang) | 34 (YAML frontmatter) | Instruction-based | 13 instructions |
|
||||
|
||||
@ -160,7 +160,7 @@ Copy-Item -Recurse rules/typescript "$HOME/.claude/rules/"
|
||||
/plugin list ecc@ecc
|
||||
```
|
||||
|
||||
**完成!** 你现在可以使用 53 个代理、203 个技能和 69 个命令。
|
||||
**完成!** 你现在可以使用 54 个代理、204 个技能和 69 个命令。
|
||||
|
||||
### multi-* 命令需要额外配置
|
||||
|
||||
|
||||
153
agents/mle-reviewer.md
Normal file
153
agents/mle-reviewer.md
Normal file
@ -0,0 +1,153 @@
|
||||
---
|
||||
name: mle-reviewer
|
||||
description: Production machine-learning engineering reviewer for data contracts, feature pipelines, training reproducibility, offline/online evaluation, model serving, monitoring, and rollback. Use when ML, MLOps, model training, inference, feature store, or evaluation code changes.
|
||||
tools: ["Read", "Grep", "Glob", "Bash"]
|
||||
model: sonnet
|
||||
---
|
||||
|
||||
# MLE Reviewer
|
||||
|
||||
You are a senior machine-learning engineering reviewer focused on moving model code from "works in a notebook" to production-safe ML systems. Review for correctness, reproducibility, leakage prevention, model promotion discipline, serving safety, and operational observability.
|
||||
|
||||
## Start Here
|
||||
|
||||
1. Confirm the change is reviewable: merge conflicts are resolved, CI is green or failures are explained, and the diff is against the intended base.
|
||||
2. Inspect recent changes: `git diff --stat` and `git diff -- '*.py' '*.sql' '*.yaml' '*.yml' '*.json' '*.toml' '*.ipynb'`.
|
||||
3. Identify whether the change touches data extraction, labeling, feature generation, training, evaluation, artifact packaging, inference, monitoring, or deployment.
|
||||
4. Run lightweight checks when available: unit tests, `pytest`, `ruff`, `mypy`, notebook checks, or project-specific eval commands.
|
||||
5. Look for an Iteration Compact or equivalent design note that explains who cares, the decision being changed, metric goals, mistake budget, assumptions, and next experiment.
|
||||
6. Review the changed files against the production ML checklist below.
|
||||
|
||||
Do not rewrite the system unless asked. Report concrete findings with file and line references, ordered by severity.
|
||||
|
||||
## Reuse Existing Review Lanes
|
||||
|
||||
MLE review should compose existing SWE review surfaces instead of replacing them:
|
||||
|
||||
- Use `python-reviewer` for Python style, typing, error handling, dependency hygiene, and unsafe deserialization.
|
||||
- Use `pytorch-build-resolver` when tensor shape, device placement, gradient, CUDA, DataLoader, or AMP failures block training/inference.
|
||||
- Use `database-reviewer` for feature tables, label stores, prediction logs, experiment metrics, and point-in-time query performance.
|
||||
- Use `security-reviewer` for secrets, PII, prompt/data leakage, artifact integrity, unsafe pickle/joblib loading, and supply-chain risk.
|
||||
- Use `performance-optimizer` for latency, memory, batching, GPU utilization, cold start, and cost per prediction.
|
||||
- Use `build-error-resolver` for CI, dependency, native extension, CUDA, and environment-specific failures outside PyTorch itself.
|
||||
- Use `pr-test-analyzer` when the change claims coverage but does not prove leakage, schema drift, serving fallback, or promotion-gate behavior.
|
||||
- Use `silent-failure-hunter` when pipelines can appear green while skipping data, labels, eval slices, alerts, or artifact publication.
|
||||
- Use `e2e-runner` for product flows where predictions affect user-visible or business-critical behavior.
|
||||
- Use `a11y-architect` when prediction explanations, confidence states, or fallback UI need to be accessible.
|
||||
- Use `doc-updater` when new model contracts, promotion gates, dashboards, or rollback runbooks need durable project documentation.
|
||||
- Use `documentation-lookup` before relying on evolving ML serving, vector DB, feature store, or eval-framework APIs.
|
||||
|
||||
## Critical Review Areas
|
||||
|
||||
### Problem Framing and Decision Quality
|
||||
|
||||
- The change starts from a user or system decision, not from model architecture preference.
|
||||
- Stakeholders and failure costs are explicit: false positives, false negatives, latency, compute spend, opacity, and missed opportunities.
|
||||
- Metric choices follow the mistake budget instead of relying on generic accuracy.
|
||||
- Assumptions, constraints, and missing requirements are visible enough to challenge.
|
||||
- The proposed change is the simplest plausible experiment that addresses the dominant error mode.
|
||||
- Prior art or a nearby known problem was checked before introducing a bespoke approach.
|
||||
- Adversarial behavior, incentives, selective disclosure, distribution shift, and feedback loops were considered when relevant.
|
||||
|
||||
### Metrics, Thresholds, and Error Analysis
|
||||
|
||||
- Baseline and current production behavior are compared before model complexity increases.
|
||||
- Precision, recall, F1, AUC, calibration, latency, cost, and group/slice metrics are used only when they match the decision context.
|
||||
- Thresholds and configs are treated as product decisions with explicit tradeoffs, not magic constants.
|
||||
- False positives and false negatives are inspected directly and clustered by shared traits.
|
||||
- Important mistakes are traced to label quality, missing signal, threshold/config choice, product ambiguity, data bug, or serving mismatch.
|
||||
- Lessons from errors become regression tests, eval slices, dashboard panels, or runbook entries.
|
||||
|
||||
### Data Contract and Leakage
|
||||
|
||||
- Entity grain, primary key, label timestamp, feature timestamp, and snapshot/version are explicit.
|
||||
- Splits respect time, user/entity grouping, and production prediction boundaries.
|
||||
- Feature joins are point-in-time correct and do not use future labels, post-outcome fields, or mutable aggregates.
|
||||
- Missing values, units, ranges, categorical domains, and schema drift are validated before training and serving.
|
||||
- PII and sensitive attributes are excluded or justified, with retention and logging controls.
|
||||
|
||||
### Training Reproducibility
|
||||
|
||||
- Training is runnable from code, config, dataset version, and seed without notebook state.
|
||||
- Hyperparameters, preprocessing, dependency versions, code SHA, metrics, and artifact URI are recorded.
|
||||
- Randomness and GPU nondeterminism are handled deliberately.
|
||||
- Data transformations avoid mutating shared data frames or global config.
|
||||
- Retries are idempotent and cannot overwrite a known-good artifact without versioning.
|
||||
|
||||
### Evaluation and Promotion
|
||||
|
||||
- Metrics compare against a baseline and current production model.
|
||||
- Promotion gates are declared before selection and fail closed.
|
||||
- Slice metrics cover important cohorts, traffic sources, geographies, devices, languages, and sparse segments.
|
||||
- Calibration, latency, cost, fairness, and business guardrails are included when relevant.
|
||||
- Test data is not repeatedly tuned against.
|
||||
- Regression tests cover known model, data, and serving failure modes.
|
||||
|
||||
### Serving and Deployment
|
||||
|
||||
- Training and serving transformations are shared or equivalence-tested.
|
||||
- Input schema rejects stale, missing, invalid, and out-of-range features.
|
||||
- Output schema includes model version and confidence or calibration fields when useful.
|
||||
- Inference path has timeouts, resource limits, batching behavior, and fallback logic.
|
||||
- Artifact packaging includes preprocessing, config, version, dataset reference, and dependency constraints.
|
||||
- Rollout plan supports shadow traffic, canary, A/B test, or immediate rollback as appropriate.
|
||||
|
||||
### Monitoring and Incident Response
|
||||
|
||||
- Monitoring covers service health, feature drift, prediction drift, label arrival, delayed quality, and business guardrails.
|
||||
- Logs include enough identifiers to join predictions to delayed labels without leaking sensitive data.
|
||||
- Alerts have thresholds and owners.
|
||||
- Rollback names the previous artifact, config, data dependency, and traffic switch.
|
||||
- On-call runbooks include common failure modes: stale features, missing labels, model server overload, schema drift, and bad artifact promotion.
|
||||
|
||||
## Common Blockers
|
||||
|
||||
- Random train/test split on time-dependent or user-dependent data.
|
||||
- Feature generation uses fields that are unavailable at prediction time.
|
||||
- Offline metric improves while key slices regress.
|
||||
- Training preprocessing was copied into serving code manually.
|
||||
- Model version is absent from prediction logs.
|
||||
- Promotion depends on a notebook, manual chart, or local file.
|
||||
- Monitoring only checks uptime, not data or prediction quality.
|
||||
- Rollback requires retraining.
|
||||
- Secrets, credentials, or PII appear in datasets, notebooks, logs, prompts, or artifacts.
|
||||
|
||||
## Diagnostic Commands
|
||||
|
||||
Use what exists in the project. Do not install new packages without approval.
|
||||
|
||||
```bash
|
||||
pytest
|
||||
ruff check .
|
||||
mypy .
|
||||
python -m pytest tests/ -k "model or feature or eval or inference"
|
||||
git grep -nE "train_test_split|random_split|fit_transform|predict_proba|model_version|feature_store|artifact"
|
||||
git grep -nE "customer_id|email|phone|ssn|api_key|secret|token" -- '*.py' '*.sql' '*.ipynb'
|
||||
```
|
||||
|
||||
For notebooks, inspect executed outputs and hidden state. Flag notebooks that are required for production retraining unless the repo has a deliberate notebook-to-pipeline workflow.
|
||||
|
||||
## Output Format
|
||||
|
||||
```text
|
||||
[SEVERITY] Issue title
|
||||
File: path/to/file.py:42
|
||||
Issue: What is wrong and why it matters for production ML
|
||||
Fix: Concrete correction or gate to add
|
||||
```
|
||||
|
||||
End with:
|
||||
|
||||
```text
|
||||
Decision: APPROVE | APPROVE WITH WARNINGS | BLOCK
|
||||
Primary risks: data leakage | irreproducible training | weak eval | unsafe serving | missing monitoring | other
|
||||
Tests run: commands and outcomes
|
||||
```
|
||||
|
||||
## Approval Criteria
|
||||
|
||||
- **APPROVE**: No critical/high MLE risks and relevant tests or eval gates pass.
|
||||
- **APPROVE WITH WARNINGS**: Medium issues only, with explicit follow-up.
|
||||
- **BLOCK**: Any plausible leakage, irreproducible promotion, unsafe serving behavior, missing rollback for production deployment, sensitive data exposure, or critical eval gap.
|
||||
|
||||
Reference skill: `mle-workflow`.
|
||||
@ -1,6 +1,6 @@
|
||||
# Everything Claude Code (ECC) — 智能体指令
|
||||
|
||||
这是一个**生产就绪的 AI 编码插件**,提供 53 个专业代理、203 项技能、69 条命令以及自动化钩子工作流,用于软件开发。
|
||||
这是一个**生产就绪的 AI 编码插件**,提供 54 个专业代理、204 项技能、69 条命令以及自动化钩子工作流,用于软件开发。
|
||||
|
||||
**版本:** 2.0.0-rc.1
|
||||
|
||||
@ -146,8 +146,8 @@
|
||||
## 项目结构
|
||||
|
||||
```
|
||||
agents/ — 53 个专业子代理
|
||||
skills/ — 203 个工作流技能和领域知识
|
||||
agents/ — 54 个专业子代理
|
||||
skills/ — 204 个工作流技能和领域知识
|
||||
commands/ — 69 个斜杠命令
|
||||
hooks/ — 基于触发的自动化
|
||||
rules/ — 始终遵循的指导方针(通用 + 每种语言)
|
||||
|
||||
@ -224,7 +224,7 @@ Copy-Item -Recurse rules/typescript "$HOME/.claude/rules/"
|
||||
/plugin list ecc@ecc
|
||||
```
|
||||
|
||||
**搞定!** 你现在可以使用 53 个智能体、203 项技能和 69 个命令了。
|
||||
**搞定!** 你现在可以使用 54 个智能体、204 项技能和 69 个命令了。
|
||||
|
||||
***
|
||||
|
||||
@ -1132,9 +1132,9 @@ opencode
|
||||
|
||||
| 功能特性 | Claude Code | OpenCode | 状态 |
|
||||
|---------|-------------|----------|--------|
|
||||
| 智能体 | PASS: 53 个 | PASS: 12 个 | **Claude Code 领先** |
|
||||
| 智能体 | PASS: 54 个 | PASS: 12 个 | **Claude Code 领先** |
|
||||
| 命令 | PASS: 69 个 | PASS: 31 个 | **Claude Code 领先** |
|
||||
| 技能 | PASS: 203 项 | PASS: 37 项 | **Claude Code 领先** |
|
||||
| 技能 | PASS: 204 项 | PASS: 37 项 | **Claude Code 领先** |
|
||||
| 钩子 | PASS: 8 种事件类型 | PASS: 11 种事件 | **OpenCode 更多!** |
|
||||
| 规则 | PASS: 29 条 | PASS: 13 条指令 | **Claude Code 领先** |
|
||||
| MCP 服务器 | PASS: 14 个 | PASS: 完整 | **完全对等** |
|
||||
@ -1240,9 +1240,9 @@ ECC 是**第一个最大化利用每个主要 AI 编码工具的插件**。以
|
||||
|
||||
| 功能特性 | Claude Code | Cursor IDE | Codex CLI | OpenCode |
|
||||
|---------|------------|------------|-----------|----------|
|
||||
| **智能体** | 53 | 共享 (AGENTS.md) | 共享 (AGENTS.md) | 12 |
|
||||
| **智能体** | 54 | 共享 (AGENTS.md) | 共享 (AGENTS.md) | 12 |
|
||||
| **命令** | 69 | 共享 | 基于指令 | 31 |
|
||||
| **技能** | 203 | 共享 | 10 (原生格式) | 37 |
|
||||
| **技能** | 204 | 共享 | 10 (原生格式) | 37 |
|
||||
| **钩子事件** | 8 种类型 | 15 种类型 | 暂无 | 11 种类型 |
|
||||
| **钩子脚本** | 20+ 个脚本 | 16 个脚本 (DRY 适配器) | N/A | 插件钩子 |
|
||||
| **规则** | 34 (通用 + 语言) | 34 (YAML 前页) | 基于指令 | 13 条指令 |
|
||||
|
||||
@ -259,6 +259,14 @@
|
||||
"devops-infra"
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "capability:machine-learning",
|
||||
"family": "capability",
|
||||
"description": "Production machine-learning engineering workflows for data contracts, reproducible training, evaluation, deployment, monitoring, and rollback.",
|
||||
"modules": [
|
||||
"machine-learning"
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "capability:supply-chain",
|
||||
"family": "capability",
|
||||
@ -347,6 +355,14 @@
|
||||
"agents-core"
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "agent:mle-reviewer",
|
||||
"family": "agent",
|
||||
"description": "Production machine-learning engineering reviewer for ML pipelines, evals, serving, monitoring, and rollback.",
|
||||
"modules": [
|
||||
"agents-core"
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "skill:tdd-workflow",
|
||||
"family": "skill",
|
||||
@ -426,6 +442,14 @@
|
||||
"modules": [
|
||||
"research-apis"
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "skill:mle-workflow",
|
||||
"family": "skill",
|
||||
"description": "Production machine-learning engineering workflow for data contracts, reproducible training, evaluation, deployment, monitoring, and rollback.",
|
||||
"modules": [
|
||||
"machine-learning"
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
@ -574,6 +574,32 @@
|
||||
"cost": "medium",
|
||||
"stability": "stable"
|
||||
},
|
||||
{
|
||||
"id": "machine-learning",
|
||||
"kind": "skills",
|
||||
"description": "Production machine-learning engineering workflows for data contracts, reproducible training, evaluation, deployment, monitoring, and rollback.",
|
||||
"paths": [
|
||||
"skills/mle-workflow"
|
||||
],
|
||||
"targets": [
|
||||
"claude",
|
||||
"cursor",
|
||||
"antigravity",
|
||||
"codex",
|
||||
"opencode",
|
||||
"codebuddy"
|
||||
],
|
||||
"dependencies": [
|
||||
"framework-language",
|
||||
"workflow-quality",
|
||||
"database",
|
||||
"devops-infra",
|
||||
"security"
|
||||
],
|
||||
"defaultInstall": false,
|
||||
"cost": "medium",
|
||||
"stability": "beta"
|
||||
},
|
||||
{
|
||||
"id": "supply-chain-domain",
|
||||
"kind": "skills",
|
||||
|
||||
@ -83,6 +83,7 @@
|
||||
"swift-apple",
|
||||
"agentic-patterns",
|
||||
"devops-infra",
|
||||
"machine-learning",
|
||||
"supply-chain-domain",
|
||||
"document-processing"
|
||||
]
|
||||
|
||||
@ -189,6 +189,7 @@
|
||||
"skills/market-research/",
|
||||
"skills/mcp-server-patterns/",
|
||||
"skills/messages-ops/",
|
||||
"skills/mle-workflow/",
|
||||
"skills/mysql-patterns/",
|
||||
"skills/nanoclaw-repl/",
|
||||
"skills/nestjs-patterns/",
|
||||
|
||||
@ -57,6 +57,31 @@ const COMPONENT_ALIASES = Object.freeze({
|
||||
'capability:social': ['distribution', 'post', 'posting', 'publish', 'publishing', 'twitter', 'x'],
|
||||
'capability:media': ['editing', 'image', 'remotion', 'slides', 'video'],
|
||||
'capability:orchestration': ['dmux', 'parallel', 'tmux', 'worktree', 'worktrees'],
|
||||
'capability:machine-learning': [
|
||||
'data-science',
|
||||
'ml',
|
||||
'mle',
|
||||
'mlops',
|
||||
'model',
|
||||
'models',
|
||||
'pytorch',
|
||||
'training',
|
||||
],
|
||||
'agent:mle-reviewer': [
|
||||
'data-science',
|
||||
'ml',
|
||||
'mle',
|
||||
'mlops',
|
||||
'model',
|
||||
'models',
|
||||
'training',
|
||||
'inference',
|
||||
'serving',
|
||||
'evaluation',
|
||||
'evals',
|
||||
'model-review',
|
||||
'review-training',
|
||||
],
|
||||
'framework:nextjs': ['next', 'next.js', 'nextjs'],
|
||||
'framework:react': ['react', 'tsx'],
|
||||
'framework:django': ['django'],
|
||||
|
||||
344
skills/mle-workflow/SKILL.md
Normal file
344
skills/mle-workflow/SKILL.md
Normal file
@ -0,0 +1,344 @@
|
||||
---
|
||||
name: mle-workflow
|
||||
description: Production machine-learning engineering workflow for data contracts, reproducible training, model evaluation, deployment, monitoring, and rollback. Use when building, reviewing, or hardening ML systems beyond one-off notebooks.
|
||||
origin: ECC
|
||||
---
|
||||
|
||||
# Machine Learning Engineering Workflow
|
||||
|
||||
Use this skill to turn model work into a production ML system with clear data contracts, repeatable training, measurable quality gates, deployable artifacts, and operational monitoring.
|
||||
|
||||
## When to Activate
|
||||
|
||||
- Planning or reviewing a production ML feature, model refresh, ranking system, recommender, classifier, embedding workflow, or forecasting pipeline
|
||||
- Converting notebook code into a reusable training, evaluation, batch inference, or online inference pipeline
|
||||
- Designing model promotion criteria, offline/online evals, experiment tracking, or rollback paths
|
||||
- Debugging failures caused by data drift, label leakage, stale features, artifact mismatch, or inconsistent training and serving logic
|
||||
- Adding model monitoring, canary rollout, shadow traffic, or post-deploy quality checks
|
||||
|
||||
## Scope Calibration
|
||||
|
||||
Use only the lanes that fit the system in front of you. This skill is useful for ranking, search, recommendations, classifiers, forecasting, embeddings, LLM workflows, anomaly detection, and batch analytics, but it should not force one architecture onto all of them.
|
||||
|
||||
- Do not assume every model has supervised labels, online serving, a feature store, PyTorch, GPUs, human review, A/B tests, or real-time feedback.
|
||||
- Do not add heavyweight MLOps machinery when a data contract, baseline, eval script, and rollback note would make the change reviewable.
|
||||
- Do make assumptions explicit when the project lacks labels, delayed outcomes, slice definitions, production traffic, or monitoring ownership.
|
||||
- Treat examples as interchangeable scaffolds. Replace metrics, serving mode, data stores, and rollout mechanics with the project-native equivalents.
|
||||
|
||||
## Related Skills
|
||||
|
||||
- `python-patterns` and `python-testing` for Python implementation and pytest coverage
|
||||
- `pytorch-patterns` for deep learning models, data loaders, device handling, and training loops
|
||||
- `eval-harness` and `ai-regression-testing` for promotion gates and agent-assisted regression checks
|
||||
- `database-migrations`, `postgres-patterns`, and `clickhouse-io` for data storage and analytics surfaces
|
||||
- `deployment-patterns`, `docker-patterns`, and `security-review` for serving, secrets, containers, and production hardening
|
||||
|
||||
## Reuse the SWE Surface
|
||||
|
||||
Do not treat MLE as separate from software engineering. Most ECC SWE workflows apply directly to ML systems, often with stricter failure modes:
|
||||
|
||||
| SWE surface | MLE use |
|
||||
|-------------|---------|
|
||||
| `product-capability` / `architecture-decision-records` | Turn model work into explicit product contracts and record irreversible data, model, and rollout choices |
|
||||
| `repo-scan` / `codebase-onboarding` / `code-tour` | Find existing training, feature, serving, eval, and monitoring paths before introducing a parallel ML stack |
|
||||
| `plan` / `feature-dev` | Scope model changes as product capabilities with data, eval, serving, and rollback phases |
|
||||
| `tdd-workflow` / `python-testing` | Test feature transforms, split logic, metric calculations, artifact loading, and inference schemas before implementation |
|
||||
| `code-reviewer` / `mle-reviewer` | Review code quality plus ML-specific leakage, reproducibility, promotion, and monitoring risks |
|
||||
| `build-fix` / `pr-test-analyzer` | Diagnose broken CI, flaky evals, missing fixtures, and environment-specific model or dependency failures |
|
||||
| `quality-gate` / `test-coverage` | Require automated evidence for transforms, metrics, inference contracts, promotion gates, and rollback behavior |
|
||||
| `eval-harness` / `verification-loop` | Turn offline metrics, slice checks, latency budgets, and rollback drills into repeatable gates |
|
||||
| `ai-regression-testing` | Preserve every production bug as a regression: missing feature, stale label, bad artifact, schema drift, or serving mismatch |
|
||||
| `api-design` / `backend-patterns` | Design prediction APIs, batch jobs, idempotent retraining endpoints, and response envelopes |
|
||||
| `database-migrations` / `postgres-patterns` / `clickhouse-io` | Version labels, feature snapshots, prediction logs, experiment metrics, and drift analytics |
|
||||
| `deployment-patterns` / `docker-patterns` | Package reproducible training and serving images with health checks, resource limits, and rollback |
|
||||
| `canary-watch` / `dashboard-builder` | Make rollout health visible with model-version, slice, drift, latency, cost, and delayed-label dashboards |
|
||||
| `security-review` / `security-scan` | Check model artifacts, notebooks, prompts, datasets, and logs for secrets, PII, unsafe deserialization, and supply-chain risk |
|
||||
| `e2e-testing` / `browser-qa` / `accessibility` | Test critical product flows that consume predictions, including explainability and fallback UI states |
|
||||
| `benchmark` / `performance-optimizer` | Measure throughput, p95 latency, memory, GPU utilization, and cost per prediction or retrain |
|
||||
| `cost-aware-llm-pipeline` / `token-budget-advisor` | Route LLM/embedding workloads by quality, latency, and budget instead of defaulting to the largest model |
|
||||
| `documentation-lookup` / `search-first` | Verify current library behavior for model serving, feature stores, vector DBs, and eval tooling before coding |
|
||||
| `git-workflow` / `github-ops` / `opensource-pipeline` | Package MLE changes for review with crisp scope, generated artifacts excluded, and reproducible test evidence |
|
||||
| `strategic-compact` / `dmux-workflows` | Split long ML work into parallel tracks: data contract, eval harness, serving path, monitoring, and docs |
|
||||
|
||||
## Ten MLE Task Simulations
|
||||
|
||||
Use these simulations as coverage checks when planning or reviewing MLE work. A strong MLE workflow should reduce each task to explicit contracts, reusable SWE surfaces, automated evidence, and a reviewable artifact.
|
||||
|
||||
| ID | Common MLE task | Streamlined ECC path | Required output | Pipeline lanes covered |
|
||||
|----|-----------------|----------------------|-----------------|------------------------|
|
||||
| MLE-01 | Frame an ambiguous prediction, ranking, recommender, classifier, embedding, or forecast capability | `product-capability`, `plan`, `architecture-decision-records`, `mle-workflow` | Iteration Compact naming who cares, decision owner, success metric, unacceptable mistakes, assumptions, constraints, and first experiment | product contract, stakeholder loss, risk, rollout |
|
||||
| MLE-02 | Define metric goals, labels, data sources, and the mistake budget | `repo-scan`, `database-reviewer`, `database-migrations`, `postgres-patterns`, `clickhouse-io` | Data and metric contract with entity grain, label timing, label confidence, feature timing, point-in-time joins, split policy, and dataset snapshot | data contract, metric design, leakage, reproducibility |
|
||||
| MLE-03 | Build a baseline model and scoring path before adding complexity | `tdd-workflow`, `python-testing`, `python-patterns`, `code-reviewer` | Baseline scorer with confusion matrix, calibration notes, latency/cost estimate, known weaknesses, and tests for score shape and determinism | baseline, scoring, testing, serving parity |
|
||||
| MLE-04 | Generate features from hypotheses about what separates outcomes | `python-patterns`, `pytorch-patterns`, `docker-patterns`, `deployment-patterns` | Feature plan and transform module covering signal source, missing values, outliers, correlations, leakage checks, and train/serve equivalence | feature pipeline, leakage, training, artifacts |
|
||||
| MLE-05 | Tune thresholds, configs, and model complexity under tradeoffs | `eval-harness`, `ai-regression-testing`, `quality-gate`, `test-coverage` | Threshold/config report comparing precision, recall, F1, AUC, calibration, group slices, latency, cost, complexity, and acceptable error classes | evaluation, threshold, promotion, regression |
|
||||
| MLE-06 | Run error analysis and turn mistakes into the next experiment | `eval-harness`, `ai-regression-testing`, `mle-reviewer`, `silent-failure-hunter` | Error cluster report for false positives, false negatives, ambiguous labels, stale features, missing signals, and bug traces with lessons captured | error analysis, bug trace, iteration, regression |
|
||||
| MLE-07 | Package a model artifact for batch or online inference | `api-design`, `backend-patterns`, `security-review`, `security-scan` | Versioned artifact bundle with preprocessing, config, dependency constraints, schema validation, safe loading, and PII-safe logs | artifact, security, inference contract |
|
||||
| MLE-08 | Ship online serving or batch scoring with feedback capture | `api-design`, `backend-patterns`, `e2e-testing`, `browser-qa`, `accessibility` | Prediction endpoint or batch job with response envelope, timeout, batching, fallback, model version, confidence, feedback logging, and product-flow tests | serving, batch inference, fallback, user workflow |
|
||||
| MLE-09 | Roll out a model with shadow traffic, canary, A/B test, or rollback | `canary-watch`, `dashboard-builder`, `verification-loop`, `performance-optimizer` | Rollout plan naming traffic split, dashboards, p95 latency, cost, quality guardrails, rollback artifact, and rollback trigger | deployment, canary, rollback |
|
||||
| MLE-10 | Operate, debug, and refresh a production model after launch | `silent-failure-hunter`, `dashboard-builder`, `mle-reviewer`, `doc-updater`, `github-ops` | Observation ledger and refresh plan with drift checks, delayed-label health, alert owners, runbook updates, retrain criteria, and PR evidence | monitoring, incident response, retraining |
|
||||
|
||||
## Iteration Compact
|
||||
|
||||
Before touching model code, compress the work into one reviewable artifact. This should be short enough to fit in a PR description and precise enough that another engineer can challenge the tradeoffs.
|
||||
|
||||
```text
|
||||
Goal:
|
||||
Who cares:
|
||||
Decision owner:
|
||||
User or system action changed by the model:
|
||||
Success metric:
|
||||
Guardrail metrics:
|
||||
Mistake budget:
|
||||
Unacceptable mistakes:
|
||||
Acceptable mistakes:
|
||||
Assumptions:
|
||||
Constraints:
|
||||
Labels and data snapshot:
|
||||
Baseline:
|
||||
Candidate signals:
|
||||
Threshold or config plan:
|
||||
Eval slices:
|
||||
Known risks:
|
||||
Next experiment:
|
||||
Rollback or fallback:
|
||||
```
|
||||
|
||||
This compact is the MLE equivalent of a strong SWE design note. It keeps the team from optimizing a metric no one trusts, adding features that do not address the real error mode, or shipping complexity without a rollback.
|
||||
|
||||
## Decision Brain
|
||||
|
||||
Use this loop whenever the task is ambiguous, high-impact, or metric-heavy:
|
||||
|
||||
1. Start from the decision, not the model. Name the action that changes downstream behavior.
|
||||
2. Name who cares and why. Different stakeholders pay different costs for false positives, false negatives, latency, compute spend, opacity, or missed opportunities.
|
||||
3. Convert ambiguity into hypotheses. Ask what signal would separate outcomes, what evidence would disprove it, and what simple baseline should be hard to beat.
|
||||
4. Research prior art or a nearby known problem before inventing a bespoke system.
|
||||
5. Score choices with `(probability, confidence) x (cost, severity, importance, impact)`.
|
||||
6. Consider adversarial behavior, incentives, selective disclosure, distribution shift, and feedback loops.
|
||||
7. Prefer the simplest change that reduces the most important mistake. Simplicity is not laziness; it is a way to minimize blunders while preserving iteration speed.
|
||||
8. Capture the decision, evidence, counterargument, and next reversible step.
|
||||
|
||||
## Metric and Mistake Economics
|
||||
|
||||
Choose metrics from failure costs, not habit:
|
||||
|
||||
- Use a confusion matrix early so the team can discuss concrete false positives and false negatives instead of abstract accuracy.
|
||||
- Favor precision when the cost of an incorrect positive decision dominates.
|
||||
- Favor recall when the cost of a missed positive dominates.
|
||||
- Use F1 only when the precision/recall tradeoff is genuinely balanced and explainable.
|
||||
- Use AUC or ranking metrics when ordering quality matters more than a single threshold.
|
||||
- Track latency, throughput, memory, and cost as first-class metrics because they shape feasible model complexity.
|
||||
- Compare against a baseline and the current production model before celebrating an offline gain.
|
||||
- Treat real-world feedback signals as delayed labels with bias, lag, and coverage gaps; do not treat them as ground truth without analysis.
|
||||
|
||||
Every metric choice should state which mistake it makes cheaper, which mistake it makes more likely, and who absorbs that cost.
|
||||
|
||||
## Data and Feature Hypotheses
|
||||
|
||||
Features should come from a theory of separation:
|
||||
|
||||
- Text, categorical fields, numeric histories, graph relationships, recency, frequency, and aggregates are candidate signal families, not automatic features.
|
||||
- For every feature family, state why it should separate outcomes and how it could leak future information.
|
||||
- For noisy labels, consider adjudication, label confidence, soft targets, or confidence weighting.
|
||||
- For class imbalance, compare weighted loss, resampling, threshold movement, and calibrated decision rules.
|
||||
- For missing values, decide whether absence is informative, imputable, or a reason to abstain.
|
||||
- For outliers, decide whether to clip, bucket, investigate, or preserve them as rare but important signal.
|
||||
- For correlated features, check whether they are redundant, unstable, or proxies for unavailable future state.
|
||||
|
||||
Do not add model complexity until error analysis shows that the baseline is failing for a reason additional signal or capacity can plausibly fix.
|
||||
|
||||
## Error Analysis Loop
|
||||
|
||||
After each baseline, training run, threshold change, or config change:
|
||||
|
||||
1. Split mistakes into false positives, false negatives, abstentions, low-confidence cases, and system failures.
|
||||
2. Cluster errors by shared traits: language, entity type, source, time, geography, device, sparsity, recency, feature freshness, label source, or model version.
|
||||
3. Separate model mistakes from data bugs, label ambiguity, product ambiguity, instrumentation gaps, and serving mismatches.
|
||||
4. Trace each major cluster to one of four moves: better labels, better features, better threshold/config, or better product fallback.
|
||||
5. Preserve every important mistake as a regression test, eval slice, dashboard panel, or runbook entry.
|
||||
6. Write the next iteration as a falsifiable experiment, not a vague "improve model" task.
|
||||
|
||||
The strongest MLE loop is not train -> metric -> ship. It is mistake -> cluster -> hypothesis -> experiment -> evidence -> simpler system.
|
||||
|
||||
## Observation Ledger
|
||||
|
||||
Keep a compact decision and evidence trail beside the code, PR, experiment report, or runbook:
|
||||
|
||||
```text
|
||||
Iteration:
|
||||
Change:
|
||||
Why this mattered:
|
||||
Metric movement:
|
||||
Slice movement:
|
||||
False positives:
|
||||
False negatives:
|
||||
Unexpected errors:
|
||||
Decision:
|
||||
Tradeoff accepted:
|
||||
Lesson captured:
|
||||
Regression added:
|
||||
Debt created:
|
||||
Next iteration:
|
||||
```
|
||||
|
||||
Use the ledger to make model work cumulative. The goal is for each iteration to make the next decision easier, not merely to produce another artifact.
|
||||
|
||||
## Core Workflow
|
||||
|
||||
### 1. Define the Prediction Contract
|
||||
|
||||
Capture the product-level contract before writing model code:
|
||||
|
||||
- Prediction target and decision owner
|
||||
- Input entity, output schema, confidence/calibration fields, and allowed latency
|
||||
- Batch, online, streaming, or hybrid serving mode
|
||||
- Fallback behavior when the model, feature store, or dependency is unavailable
|
||||
- Human review or override path for high-impact decisions
|
||||
- Privacy, retention, and audit requirements for inputs, predictions, and labels
|
||||
|
||||
Do not accept "improve the model" as a requirement. Tie the model to an observable product behavior and a measurable acceptance gate.
|
||||
|
||||
### 2. Lock the Data Contract
|
||||
|
||||
Every ML task needs an explicit data contract:
|
||||
|
||||
- Entity grain and primary key
|
||||
- Label definition, label timestamp, and label availability delay
|
||||
- Feature timestamp, freshness SLA, and point-in-time join rules
|
||||
- Train, validation, test, and backtest split policy
|
||||
- Required columns, allowed nulls, ranges, categories, and units
|
||||
- PII or sensitive fields that must not enter training artifacts or logs
|
||||
- Dataset version or snapshot ID for reproducibility
|
||||
|
||||
Guard against leakage first. If a feature is not available at prediction time, or is joined using future information, remove it or move it to an analysis-only path.
|
||||
|
||||
### 3. Build a Reproducible Pipeline
|
||||
|
||||
Training code should be runnable by another engineer without hidden notebook state:
|
||||
|
||||
- Use typed config files or dataclasses for all hyperparameters and paths
|
||||
- Pin package and model dependencies
|
||||
- Set random seeds and document any nondeterministic GPU behavior
|
||||
- Record dataset version, code SHA, config hash, metrics, and artifact URI
|
||||
- Save preprocessing logic with the model artifact, not separately in a notebook
|
||||
- Keep train, eval, and inference transformations shared or generated from one source
|
||||
- Make every step idempotent so retries do not corrupt artifacts or metrics
|
||||
|
||||
Prefer immutable values and pure transformation functions. Avoid mutating shared data frames or global config during feature generation.
|
||||
|
||||
```python
|
||||
import hashlib
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class TrainingConfig:
|
||||
dataset_uri: str
|
||||
model_dir: Path
|
||||
seed: int
|
||||
learning_rate: float
|
||||
batch_size: int
|
||||
|
||||
|
||||
def artifact_name(config: TrainingConfig, code_sha: str) -> str:
|
||||
config_key = f"{config.dataset_uri}:{config.seed}:{config.learning_rate}:{config.batch_size}"
|
||||
config_hash = hashlib.sha256(config_key.encode("utf-8")).hexdigest()[:12]
|
||||
return f"{code_sha[:12]}-{config_hash}"
|
||||
```
|
||||
|
||||
### 4. Evaluate Before Promotion
|
||||
|
||||
Promotion criteria should be declared before training finishes:
|
||||
|
||||
- Baseline model and current production model comparison
|
||||
- Primary metric aligned to product behavior
|
||||
- Guardrail metrics for latency, calibration, fairness slices, cost, and error concentration
|
||||
- Slice metrics for important cohorts, geographies, devices, languages, or data sources
|
||||
- Confidence intervals or repeated-run variance when metrics are noisy
|
||||
- Failure examples reviewed by a human for high-impact models
|
||||
- Explicit "do not ship" thresholds
|
||||
|
||||
```python
|
||||
PROMOTION_GATES = {
|
||||
"auc": ("min", 0.82),
|
||||
"calibration_error": ("max", 0.04),
|
||||
"p95_latency_ms": ("max", 80),
|
||||
}
|
||||
|
||||
|
||||
def assert_promotion_ready(metrics: dict[str, float]) -> None:
|
||||
missing = sorted(name for name in PROMOTION_GATES if name not in metrics)
|
||||
if missing:
|
||||
raise ValueError(f"Model promotion metrics missing required gates: {missing}")
|
||||
|
||||
failures = {
|
||||
name: value
|
||||
for name, (direction, threshold) in PROMOTION_GATES.items()
|
||||
for value in [metrics[name]]
|
||||
if (direction == "min" and value < threshold)
|
||||
or (direction == "max" and value > threshold)
|
||||
}
|
||||
if failures:
|
||||
raise ValueError(f"Model failed promotion gates: {failures}")
|
||||
```
|
||||
|
||||
Use offline metrics as gates, not guarantees. When the model changes product behavior, plan shadow evaluation, canary rollout, or A/B testing before full rollout.
|
||||
|
||||
### 5. Package for Serving
|
||||
|
||||
An ML artifact is production-ready only when the serving contract is testable:
|
||||
|
||||
- Model artifact includes version, training data reference, config, and preprocessing
|
||||
- Input schema rejects invalid, stale, or out-of-range features
|
||||
- Output schema includes model version and confidence or explanation fields when useful
|
||||
- Serving path has timeout, batching, resource limits, and fallback behavior
|
||||
- CPU/GPU requirements are explicit and tested
|
||||
- Prediction logs avoid PII and include enough identifiers for debugging and label joins
|
||||
- Integration tests cover missing features, stale features, bad types, empty batches, and fallback path
|
||||
|
||||
Never let training-only feature code diverge from serving feature code without a test that proves equivalence.
|
||||
|
||||
### 6. Operate the Model
|
||||
|
||||
Model monitoring needs both system and quality signals:
|
||||
|
||||
- Availability, error rate, timeout rate, queue depth, and p50/p95/p99 latency
|
||||
- Feature null rate, range drift, categorical drift, and freshness drift
|
||||
- Prediction distribution drift and confidence distribution drift
|
||||
- Label arrival health and delayed quality metrics
|
||||
- Business KPI guardrails and rollback triggers
|
||||
- Per-version dashboards for canaries and rollbacks
|
||||
|
||||
Every deployment should have a rollback plan that names the previous artifact, config, data dependency, and traffic-switch mechanism.
|
||||
|
||||
## Review Checklist
|
||||
|
||||
- [ ] Prediction contract is explicit and testable
|
||||
- [ ] Data contract defines entity grain, label timing, feature timing, and snapshot/version
|
||||
- [ ] Leakage risks were checked against prediction-time availability
|
||||
- [ ] Training is reproducible from code, config, data version, and seed
|
||||
- [ ] Metrics compare against baseline and current production model
|
||||
- [ ] Slice metrics and guardrails are included for high-risk cohorts
|
||||
- [ ] Promotion gates are automated and fail closed
|
||||
- [ ] Training and serving transformations are shared or equivalence-tested
|
||||
- [ ] Model artifact carries version, config, dataset reference, and preprocessing
|
||||
- [ ] Serving path validates inputs and has timeout, fallback, and rollback behavior
|
||||
- [ ] Monitoring covers system health, feature drift, prediction drift, and delayed labels
|
||||
- [ ] Sensitive data is excluded from artifacts, logs, prompts, and examples
|
||||
|
||||
## Anti-Patterns
|
||||
|
||||
- Notebook state is required to reproduce the model
|
||||
- Random split leaks future data into validation or test sets
|
||||
- Feature joins ignore event time and label availability
|
||||
- Offline metric improves while important slices regress
|
||||
- Thresholds are tuned on the test set repeatedly
|
||||
- Training preprocessing is copied manually into serving code
|
||||
- Model version is missing from prediction logs
|
||||
- Monitoring only checks service uptime, not data or prediction quality
|
||||
- Rollback requires retraining instead of switching to a known-good artifact
|
||||
|
||||
## Output Expectations
|
||||
|
||||
When using this skill, return concrete artifacts: data contract, promotion gates, pipeline steps, test plan, deployment plan, or review findings. Call out unknowns that block production readiness instead of filling them with assumptions.
|
||||
@ -77,6 +77,10 @@ function run() {
|
||||
assert.ok(skillDirs.length > 0, 'Expected at least one .agents/skills entry');
|
||||
})) passed++; else failed++;
|
||||
|
||||
if (test('Codex skill surface includes the MLE workflow', () => {
|
||||
assert.ok(skillDirs.includes('mle-workflow'), 'Expected .agents/skills/mle-workflow');
|
||||
})) passed++; else failed++;
|
||||
|
||||
if (test('SKILL.md frontmatter matches Codex validator expectations', () => {
|
||||
for (const skillDir of skillDirs) {
|
||||
const frontmatter = parseFrontmatter(skillDir);
|
||||
|
||||
223
tests/ci/mle-workflow-coverage.test.js
Normal file
223
tests/ci/mle-workflow-coverage.test.js
Normal file
@ -0,0 +1,223 @@
|
||||
const assert = require('assert');
|
||||
const fs = require('fs');
|
||||
const path = require('path');
|
||||
|
||||
const REPO_ROOT = path.resolve(__dirname, '..', '..');
|
||||
const CANONICAL_SKILL = path.join(REPO_ROOT, 'skills', 'mle-workflow', 'SKILL.md');
|
||||
const CODEX_SKILL = path.join(REPO_ROOT, '.agents', 'skills', 'mle-workflow', 'SKILL.md');
|
||||
|
||||
const EXPECTED_TASKS = [
|
||||
'MLE-01',
|
||||
'MLE-02',
|
||||
'MLE-03',
|
||||
'MLE-04',
|
||||
'MLE-05',
|
||||
'MLE-06',
|
||||
'MLE-07',
|
||||
'MLE-08',
|
||||
'MLE-09',
|
||||
'MLE-10',
|
||||
];
|
||||
|
||||
const PIPELINE_LANES = [
|
||||
'product contract',
|
||||
'stakeholder loss',
|
||||
'data contract',
|
||||
'metric design',
|
||||
'leakage',
|
||||
'feature pipeline',
|
||||
'baseline',
|
||||
'scoring',
|
||||
'serving parity',
|
||||
'training',
|
||||
'artifacts',
|
||||
'evaluation',
|
||||
'threshold',
|
||||
'promotion',
|
||||
'error analysis',
|
||||
'bug trace',
|
||||
'iteration',
|
||||
'inference contract',
|
||||
'serving',
|
||||
'batch inference',
|
||||
'deployment',
|
||||
'canary',
|
||||
'rollback',
|
||||
'monitoring',
|
||||
'incident response',
|
||||
'retraining',
|
||||
'security',
|
||||
'cost',
|
||||
];
|
||||
|
||||
const SWE_SURFACES = [
|
||||
'product-capability',
|
||||
'architecture-decision-records',
|
||||
'repo-scan',
|
||||
'database-reviewer',
|
||||
'tdd-workflow',
|
||||
'python-testing',
|
||||
'python-patterns',
|
||||
'pytorch-patterns',
|
||||
'docker-patterns',
|
||||
'deployment-patterns',
|
||||
'eval-harness',
|
||||
'quality-gate',
|
||||
'api-design',
|
||||
'security-review',
|
||||
'e2e-testing',
|
||||
'browser-qa',
|
||||
'build-fix',
|
||||
'pr-test-analyzer',
|
||||
'canary-watch',
|
||||
'dashboard-builder',
|
||||
'verification-loop',
|
||||
'performance-optimizer',
|
||||
'silent-failure-hunter',
|
||||
'doc-updater',
|
||||
'github-ops',
|
||||
];
|
||||
|
||||
const JUDGMENT_PRIMITIVES = [
|
||||
'Iteration Compact',
|
||||
'Who cares',
|
||||
'Decision owner',
|
||||
'Mistake budget',
|
||||
'Unacceptable mistakes',
|
||||
'Acceptable mistakes',
|
||||
'Decision Brain',
|
||||
'adversarial behavior',
|
||||
'selective disclosure',
|
||||
'(probability, confidence) x (cost, severity, importance, impact)',
|
||||
'Metric and Mistake Economics',
|
||||
'confusion matrix',
|
||||
'false positives',
|
||||
'false negatives',
|
||||
'precision',
|
||||
'recall',
|
||||
'F1',
|
||||
'AUC',
|
||||
'latency',
|
||||
'cost',
|
||||
'Data and Feature Hypotheses',
|
||||
'label confidence',
|
||||
'class imbalance',
|
||||
'missing values',
|
||||
'outliers',
|
||||
'correlated features',
|
||||
'Error Analysis Loop',
|
||||
'Observation Ledger',
|
||||
'Lesson captured',
|
||||
'Regression added',
|
||||
'Next iteration',
|
||||
];
|
||||
|
||||
const FORBIDDEN_DOMAIN_EXAMPLES = [
|
||||
'reddit',
|
||||
'subreddit',
|
||||
'moderation',
|
||||
'moderator',
|
||||
];
|
||||
|
||||
const SCOPE_CALIBRATION_PHRASES = [
|
||||
'Use only the lanes that fit the system in front of you',
|
||||
'Do not assume every model has supervised labels',
|
||||
'Do not add heavyweight MLOps machinery',
|
||||
'Replace metrics, serving mode, data stores, and rollout mechanics',
|
||||
];
|
||||
|
||||
function stripFrontmatter(content) {
|
||||
return content.replace(/^---\r?\n[\s\S]*?\r?\n---(?:\r?\n|$)/, '');
|
||||
}
|
||||
|
||||
function readSkill(filePath) {
|
||||
return fs.readFileSync(filePath, 'utf8');
|
||||
}
|
||||
|
||||
function extractSimulationRows(content) {
|
||||
return content
|
||||
.split('\n')
|
||||
.filter(line => /^\| MLE-\d{2} \|/.test(line));
|
||||
}
|
||||
|
||||
function test(name, fn) {
|
||||
try {
|
||||
fn();
|
||||
console.log(` ✓ ${name}`);
|
||||
return true;
|
||||
} catch (error) {
|
||||
console.log(` ✗ ${name}`);
|
||||
console.log(` Error: ${error.message}`);
|
||||
return false;
|
||||
}
|
||||
}
|
||||
|
||||
function run() {
|
||||
console.log('\n=== Testing MLE workflow coverage ===\n');
|
||||
|
||||
let passed = 0;
|
||||
let failed = 0;
|
||||
|
||||
const canonical = readSkill(CANONICAL_SKILL);
|
||||
const codex = readSkill(CODEX_SKILL);
|
||||
const canonicalRows = extractSimulationRows(canonical);
|
||||
|
||||
if (test('canonical and Codex MLE workflow bodies stay in sync', () => {
|
||||
assert.strictEqual(stripFrontmatter(codex), stripFrontmatter(canonical));
|
||||
})) passed++; else failed++;
|
||||
|
||||
if (test('frontmatter stripping tolerates CRLF and EOF delimiters', () => {
|
||||
assert.strictEqual(stripFrontmatter('---\r\nname: mle\r\n---\r\n# Body'), '# Body');
|
||||
assert.strictEqual(stripFrontmatter('---\nname: mle\n---'), '');
|
||||
})) passed++; else failed++;
|
||||
|
||||
if (test('MLE workflow simulates ten common MLE tasks', () => {
|
||||
assert.strictEqual(canonicalRows.length, 10, 'Expected exactly ten MLE simulation rows');
|
||||
for (const taskId of EXPECTED_TASKS) {
|
||||
assert.ok(canonicalRows.some(row => row.includes(`| ${taskId} |`)), `Missing ${taskId}`);
|
||||
}
|
||||
})) passed++; else failed++;
|
||||
|
||||
if (test('simulations cover the full production ML pipeline', () => {
|
||||
const normalized = canonicalRows.join('\n').toLowerCase();
|
||||
for (const lane of PIPELINE_LANES) {
|
||||
assert.ok(normalized.includes(lane), `Missing pipeline lane: ${lane}`);
|
||||
}
|
||||
})) passed++; else failed++;
|
||||
|
||||
if (test('simulations reuse the existing SWE workflow surface', () => {
|
||||
for (const surface of SWE_SURFACES) {
|
||||
assert.ok(canonical.includes(`\`${surface}\``), `Missing SWE surface: ${surface}`);
|
||||
}
|
||||
})) passed++; else failed++;
|
||||
|
||||
if (test('workflow captures MLE judgment primitives beyond a checklist', () => {
|
||||
for (const primitive of JUDGMENT_PRIMITIVES) {
|
||||
assert.ok(canonical.includes(primitive), `Missing judgment primitive: ${primitive}`);
|
||||
}
|
||||
})) passed++; else failed++;
|
||||
|
||||
if (test('workflow calibrates scope instead of forcing one ML architecture', () => {
|
||||
for (const phrase of SCOPE_CALIBRATION_PHRASES) {
|
||||
assert.ok(canonical.includes(phrase), `Missing scope calibration phrase: ${phrase}`);
|
||||
}
|
||||
})) passed++; else failed++;
|
||||
|
||||
if (test('promotion gate example reports missing metrics explicitly', () => {
|
||||
assert.ok(canonical.includes('missing = sorted(name for name in PROMOTION_GATES if name not in metrics)'));
|
||||
assert.ok(canonical.includes('Model promotion metrics missing required gates'));
|
||||
})) passed++; else failed++;
|
||||
|
||||
if (test('workflow stays general and avoids narrow domain examples', () => {
|
||||
const normalized = canonical.toLowerCase();
|
||||
for (const forbidden of FORBIDDEN_DOMAIN_EXAMPLES) {
|
||||
assert.ok(!normalized.includes(forbidden), `Found narrow domain example: ${forbidden}`);
|
||||
}
|
||||
})) passed++; else failed++;
|
||||
|
||||
console.log(`\nPassed: ${passed}`);
|
||||
console.log(`Failed: ${failed}`);
|
||||
process.exit(failed > 0 ? 1 : 0);
|
||||
}
|
||||
|
||||
run();
|
||||
@ -98,6 +98,12 @@ function runTests() {
|
||||
'Should include lang:c');
|
||||
assert.ok(components.some(component => component.id === 'capability:security'),
|
||||
'Should include capability:security');
|
||||
assert.ok(components.some(component => component.id === 'capability:machine-learning'),
|
||||
'Should include capability:machine-learning');
|
||||
assert.ok(components.some(component => component.id === 'agent:mle-reviewer'),
|
||||
'Should include agent:mle-reviewer');
|
||||
assert.ok(components.some(component => component.id === 'skill:mle-workflow'),
|
||||
'Should include skill:mle-workflow');
|
||||
})) passed++; else failed++;
|
||||
|
||||
if (test('gets install component details and validates component IDs', () => {
|
||||
@ -271,6 +277,30 @@ function runTests() {
|
||||
);
|
||||
})) passed++; else failed++;
|
||||
|
||||
if (test('resolves machine-learning component with workflow dependencies', () => {
|
||||
const plan = resolveInstallPlan({
|
||||
includeComponentIds: ['capability:machine-learning'],
|
||||
target: 'claude',
|
||||
projectRoot: '/workspace/ml-app',
|
||||
});
|
||||
|
||||
assert.ok(plan.selectedModuleIds.includes('machine-learning'),
|
||||
'Should include machine-learning module');
|
||||
assert.ok(plan.selectedModuleIds.includes('framework-language'),
|
||||
'Should include Python and framework-language support');
|
||||
assert.ok(plan.selectedModuleIds.includes('workflow-quality'),
|
||||
'Should include eval and verification workflows');
|
||||
assert.ok(plan.selectedModuleIds.includes('database'),
|
||||
'Should include database/data persistence support');
|
||||
assert.ok(plan.selectedModuleIds.includes('devops-infra'),
|
||||
'Should include deployment and container support');
|
||||
assert.ok(plan.selectedModuleIds.includes('security'),
|
||||
'Should include security through machine-learning dependencies');
|
||||
assert.ok(plan.operations.some(operation => (
|
||||
operation.sourceRelativePath === 'skills/mle-workflow'
|
||||
)), 'Should install the MLE workflow skill');
|
||||
})) passed++; else failed++;
|
||||
|
||||
if (test('resolves explicit modules with dependency expansion', () => {
|
||||
const plan = resolveInstallPlan({ moduleIds: ['security'] });
|
||||
assert.ok(plan.selectedModuleIds.includes('security'), 'Should include requested module');
|
||||
|
||||
@ -83,6 +83,28 @@ function runTests() {
|
||||
assert.match(result.stdout, /npx ecc plan --profile minimal --target claude --with capability:security/);
|
||||
})) passed++; else failed++;
|
||||
|
||||
if (test('recommends machine-learning component and reviewer agent', () => {
|
||||
const result = run(['mlops', 'training', 'model', 'deployment', '--json']);
|
||||
|
||||
assert.strictEqual(result.status, 0, result.stderr);
|
||||
const payload = parseJson(result.stdout);
|
||||
assert.strictEqual(payload.matches[0].componentId, 'capability:machine-learning');
|
||||
assert.ok(payload.matches[0].installCommand.includes('--with capability:machine-learning'));
|
||||
assert.ok(payload.matches.some(match => match.componentId === 'agent:mle-reviewer'));
|
||||
assert.ok(!payload.profiles.some(profile => profile.id === 'mle'));
|
||||
})) passed++; else failed++;
|
||||
|
||||
if (test('matches tokenized model review queries without making review a generic alias', () => {
|
||||
const result = run(['model', 'review', '--json']);
|
||||
|
||||
assert.strictEqual(result.status, 0, result.stderr);
|
||||
const payload = parseJson(result.stdout);
|
||||
const reviewer = payload.matches.find(match => match.componentId === 'agent:mle-reviewer');
|
||||
assert.ok(reviewer, 'Should include agent:mle-reviewer');
|
||||
assert.ok(reviewer.reasons.includes('matched "model"'));
|
||||
assert.ok(!reviewer.reasons.includes('matched "review"'));
|
||||
})) passed++; else failed++;
|
||||
|
||||
if (test('works from outside the ECC repository', () => {
|
||||
const projectDir = fs.mkdtempSync(path.join(os.tmpdir(), 'ecc-consult-project-'));
|
||||
try {
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user