chore: restructure skills repo with new agents and skill bundles

- Add new skills: deep-dive, docs-rag, meta-creator, ppt-maker, sdlc - Add agent configs: g-assistent, meta-creator, sdlc with prompt files - Add reference docs for custom agents and skills specification - Add utility scripts: install-agents.sh, orchestrate.py, puml2svg.sh - Update README and commit-message skill config - Remove deprecated skills: codereview, python, testing, typescript - Add .gitignore
2026-04-18 13:07:46 +08:00
parent 72f16d26b8
commit c0d14c6ac1
74 changed files with 5726 additions and 324 deletions
@@ -0,0 +1,480 @@
+# Kiro CLI Custom Agents — 配置参考
+
+> 原文：https://kiro.dev/docs/cli/custom-agents/configuration-reference/  
+> 更新：2026-04-14
+
+---
+
+## 快速开始
+
+推荐在 Kiro 会话中使用 `/agent generate` 命令，通过 AI 辅助生成 Agent 配置。
+
+---
+
+## 文件位置
+
+### 本地 Agent（项目级）
+
+```
+<project>/.kiro/agents/<name>.json
+```
+
+仅在该目录或其子目录下运行 Kiro CLI 时可用。
+
+### 全局 Agent（用户级）
+
+```
+~/.kiro/agents/<name>.json
+```
+
+在任意目录下均可使用。
+
+### 优先级
+
+同名 Agent 时，**本地优先于全局**（并输出警告）。
+
+---
+
+## 配置字段总览
+
+| 字段 | 说明 |
+|------|------|
+| `name` | Agent 名称（可选，默认取文件名） |
+| `description` | Agent 描述 |
+| `prompt` | 系统提示词（内联或 `file://` URI） |
+| `mcpServers` | 可访问的 MCP 服务器 |
+| `tools` | 可用工具列表 |
+| `toolAliases` | 工具名称重映射 |
+| `allowedTools` | 无需确认即可使用的工具 |
+| `toolsSettings` | 工具专项配置 |
+| `resources` | 可访问的本地资源 |
+| `hooks` | 生命周期钩子命令 |
+| `includeMcpJson` | 是否引入 mcp.json 中的 MCP 服务器 |
+| `model` | 指定使用的模型 ID |
+| `keyboardShortcut` | 快速切换快捷键 |
+| `welcomeMessage` | 切换到该 Agent 时显示的欢迎语 |
+
+---
+
+## 字段详解
+
+### `name`
+
+Agent 的标识名称，用于显示和识别。
+
+```json
+{ "name": "aws-expert" }
+```
+
+---
+
+### `description`
+
+人类可读的描述，帮助区分不同 Agent。
+
+```json
+{ "description": "An agent specialized for AWS infrastructure tasks" }
+```
+
+---
+
+### `prompt`
+
+类似系统提示词，为 Agent 提供高层上下文。支持内联文本或 `file://` URI。
+
+**内联：**
+```json
+{ "prompt": "You are an expert AWS infrastructure specialist" }
+```
+
+**文件引用：**
+```json
+{ "prompt": "file://./prompts/aws-expert.md" }
+```
+
+**路径解析规则：**
+- 相对路径：相对于 Agent 配置文件所在目录
+  - `"file://./prompt.md"` → 同目录
+  - `"file://../shared/prompt.md"` → 上级目录
+- 绝对路径：直接使用
+  - `"file:///home/user/prompts/agent.md"`
+
+---
+
+### `mcpServers`
+
+定义 Agent 可访问的 MCP 服务器。
+
+```json
+{
+  "mcpServers": {
+    "fetch": {
+      "command": "fetch3.1",
+      "args": []
+    },
+    "git": {
+      "command": "git-mcp",
+      "args": [],
+      "env": { "GIT_CONFIG_GLOBAL": "/dev/null" },
+      "timeout": 120000
+    }
+  }
+}
+```
+
+**字段：**
+- `command`（必填）：启动 MCP 服务器的命令
+- `args`（可选）：命令参数
+- `env`（可选）：环境变量
+- `timeout`（可选）：每次请求超时毫秒数，默认 `120000`
+- `oauth`（可选）：HTTP 类型 MCP 服务器的 OAuth 配置
+  - `redirectUri`：自定义重定向 URI
+  - `oauthScopes`：请求的 OAuth 权限范围数组
+
+**OAuth 示例：**
+```json
+{
+  "mcpServers": {
+    "github": {
+      "type": "http",
+      "url": "https://api.github.com/mcp",
+      "oauth": {
+        "redirectUri": "127.0.0.1:8080",
+        "oauthScopes": ["repo", "user"]
+      }
+    }
+  }
+}
+```
+
+---
+
+### `tools`
+
+Agent 可使用的工具列表。
+
+```json
+{
+  "tools": ["read", "write", "shell", "@git", "@rust-analyzer/check_code"]
+}
+```
+
+**引用方式：**
+- 内置工具：`"read"`、`"shell"`
+- MCP 服务器所有工具：`"@server_name"`
+- MCP 服务器特定工具：`"@server_name/tool_name"`
+- 所有工具：`"*"`
+- 所有内置工具：`"@builtin"`
+
+---
+
+### `toolAliases`
+
+重命名工具，解决命名冲突或创建更直观的名称。
+
+```json
+{
+  "toolAliases": {
+    "@github-mcp/get_issues": "github_issues",
+    "@gitlab-mcp/get_issues": "gitlab_issues",
+    "@aws-cloud-formation/deploy_stack_with_parameters": "deploy_cf"
+  }
+}
+```
+
+---
+
+### `allowedTools`
+
+无需用户确认即可使用的工具。支持精确匹配和通配符。
+
+```json
+{
+  "allowedTools": [
+    "read",
+    "@git/git_status",
+    "@server/read_*",
+    "@fetch"
+  ]
+}
+```
+
+**匹配方式：**
+
+| 模式 | 说明 |
+|------|------|
+| `"read"` | 精确匹配内置工具 |
+| `"@server_name/tool_name"` | 精确匹配 MCP 工具 |
+| `"@server_name"` | 该服务器的所有工具 |
+| `"@server/read_*"` | 前缀通配 |
+| `"@server/*_get"` | 后缀通配 |
+| `"@git-*/*"` | 服务器名通配 |
+| `"?ead"` | `?` 匹配单个字符 |
+
+> **注意：** `allowedTools` 不支持 `"*"` 通配所有工具。
+
+---
+
+### `toolsSettings`
+
+对特定工具进行专项配置。
+
+```json
+{
+  "toolsSettings": {
+    "write": {
+      "allowedPaths": ["src/**", "tests/**"]
+    },
+    "shell": {
+      "allowedCommands": ["git status", "git fetch"],
+      "deniedCommands": ["git commit .*", "git push .*"],
+      "autoAllowReadonly": true
+    },
+    "@git/git_status": {
+      "git_user": "$GIT_USER"
+    }
+  }
+}
+```
+
+---
+
+### `resources`
+
+Agent 可访问的本地资源，支持三种类型。
+
+#### 文件资源（`file://`）
+
+启动时直接加载到上下文。
+
+```json
+{
+  "resources": [
+    "file://README.md",
+    "file://docs/**/*.md"
+  ]
+}
+```
+
+#### Skill 资源（`skill://`）
+
+启动时仅加载元数据（name/description），按需加载完整内容，保持上下文精简。
+
+Skill 文件须以 YAML frontmatter 开头：
+
+```markdown
+---
+name: dynamodb-data-modeling
+description: Guide for DynamoDB data modeling best practices.
+---
+
+# DynamoDB Data Modeling
+...
+```
+
+```json
+{
+  "resources": [
+    "skill://.kiro/skills/**/SKILL.md"
+  ]
+}
+```
+
+#### 知识库资源（`knowledgeBase`）
+
+支持对大量文档进行索引检索。
+
+```json
+{
+  "resources": [
+    {
+      "type": "knowledgeBase",
+      "source": "file://./docs",
+      "name": "ProjectDocs",
+      "description": "Project documentation and guides",
+      "indexType": "best",
+      "autoUpdate": true
+    }
+  ]
+}
+```
+
+| 字段 | 必填 | 说明 |
+|------|------|------|
+| `type` | 是 | 固定为 `"knowledgeBase"` |
+| `source` | 是 | 索引路径，使用 `file://` 前缀 |
+| `name` | 是 | 显示名称 |
+| `description` | 否 | 内容描述 |
+| `indexType` | 否 | `"best"`（默认，质量更高）或 `"fast"` |
+| `autoUpdate` | 否 | Agent 启动时重新索引，默认 `false` |
+
+---
+
+### `hooks`
+
+在 Agent 生命周期特定时机执行命令。
+
+```json
+{
+  "hooks": {
+    "agentSpawn": [
+      { "command": "git status" }
+    ],
+    "userPromptSubmit": [
+      { "command": "ls -la" }
+    ],
+    "preToolUse": [
+      {
+        "matcher": "execute_bash",
+        "command": "{ echo \"$(date) - Bash:\"; cat; } >> /tmp/audit.log"
+      }
+    ],
+    "postToolUse": [
+      {
+        "matcher": "fs_write",
+        "command": "cargo fmt --all"
+      }
+    ],
+    "stop": [
+      { "command": "npm test" }
+    ]
+  }
+}
+```
+
+**触发时机：**
+
+| 钩子 | 触发时机 |
+|------|----------|
+| `agentSpawn` | Agent 初始化时 |
+| `userPromptSubmit` | 用户提交消息时 |
+| `preToolUse` | 工具执行前（可阻断） |
+| `postToolUse` | 工具执行后 |
+| `stop` | 助手完成响应时 |
+
+每个 hook 条目：
+- `command`（必填）：要执行的命令
+- `matcher`（可选）：用于 `preToolUse`/`postToolUse` 的工具名匹配模式，使用内部工具名（如 `fs_read`、`fs_write`、`execute_bash`、`use_aws`）
+
+---
+
+### `includeMcpJson`
+
+是否引入 `~/.kiro/settings/mcp.json`（全局）和 `<cwd>/.kiro/settings/mcp.json`（工作区）中定义的 MCP 服务器。
+
+```json
+{ "includeMcpJson": true }
+```
+
+---
+
+### `model`
+
+指定该 Agent 使用的模型 ID。未指定或不可用时回退到默认模型。
+
+```json
+{ "model": "claude-sonnet-4" }
+```
+
+可通过 `/model` 命令查看可用模型列表。
+
+---
+
+### `keyboardShortcut`
+
+快速切换到该 Agent 的键盘快捷键。
+
+```json
+{ "keyboardShortcut": "ctrl+a" }
+```
+
+**格式：** `[modifier+]key`  
+**修饰键：** `ctrl`、`shift`  
+**按键：** `a-z`、`0-9`
+
+- 当前不在该 Agent：切换到该 Agent
+- 已在该 Agent：切换回上一个 Agent
+- 多个 Agent 快捷键冲突时，快捷键被禁用并输出警告
+
+---
+
+### `welcomeMessage`
+
+切换到该 Agent 时显示的欢迎语。
+
+```json
+{ "welcomeMessage": "What would you like to build today?" }
+```
+
+---
+
+## 完整示例
+
+```json
+{
+  "name": "aws-rust-agent",
+  "description": "Specialized agent for AWS and Rust development",
+  "prompt": "file://./prompts/aws-rust-expert.md",
+  "mcpServers": {
+    "fetch": { "command": "fetch-server", "args": [] },
+    "git": { "command": "git-mcp", "args": [] }
+  },
+  "tools": ["read", "write", "shell", "aws", "@git", "@fetch/fetch_url"],
+  "toolAliases": {
+    "@git/git_status": "status",
+    "@fetch/fetch_url": "get"
+  },
+  "allowedTools": ["read", "@git/git_status"],
+  "toolsSettings": {
+    "write": { "allowedPaths": ["src/**", "tests/**", "Cargo.toml"] },
+    "aws": { "allowedServices": ["s3", "lambda"], "autoAllowReadonly": true }
+  },
+  "resources": [
+    "file://README.md",
+    "file://docs/**/*.md"
+  ],
+  "hooks": {
+    "agentSpawn": [{ "command": "git status" }],
+    "postToolUse": [{ "matcher": "fs_write", "command": "cargo fmt --all" }]
+  },
+  "model": "claude-sonnet-4",
+  "keyboardShortcut": "ctrl+shift+r",
+  "welcomeMessage": "Ready to help with AWS and Rust development!"
+}
+```
+
+---
+
+## 最佳实践
+
+### 本地 vs 全局 Agent
+
+| 本地 Agent | 全局 Agent |
+|-----------|-----------|
+| 项目专属配置 | 跨项目通用 Agent |
+| 需要访问项目文件/工具 | 个人效率工具 |
+| 通过版本控制与团队共享 | 常用工具和工作流 |
+
+### 安全建议
+
+- 仔细审查 `allowedTools`，优先使用精确匹配而非通配符
+- 对敏感操作配置 `toolsSettings`（如限制 `allowedPaths`）
+- 启用写工具（`write`、`shell`）时，Agent 拥有与当前用户相同的文件系统权限，可读写 `~/.kiro` 下所有内容
+- 使用 `preToolUse` hooks 审计或阻断敏感操作
+- 在安全环境中充分测试后再共享 Agent
+
+### 组织建议
+
+- 使用描述性名称
+- 在 `description` 中说明用途
+- 将 prompt 文件单独维护
+- 本地 Agent 随项目纳入版本控制
+
+---
+
+## 相关文档
+
+- [创建自定义 Agent](https://kiro.dev/docs/cli/custom-agents/creating/)
+- [内置工具参考](https://kiro.dev/docs/cli/reference/built-in-tools/)
+- [Hooks 文档](https://kiro.dev/docs/cli/hooks)
+- [Agent 示例](https://kiro.dev/docs/cli/custom-agents/examples/)
@@ -0,0 +1,275 @@
+> ## Documentation Index
+> Fetch the complete documentation index at: https://agentskills.io/llms.txt
+> Use this file to discover all available pages before exploring further.
+
+# Specification
+
+> The complete format specification for Agent Skills.
+
+## Directory structure
+
+A skill is a directory containing, at minimum, a `SKILL.md` file:
+
+```
+skill-name/
+├── SKILL.md          # Required: metadata + instructions
+├── scripts/          # Optional: executable code
+├── references/       # Optional: documentation
+├── assets/           # Optional: templates, resources
+└── ...               # Any additional files or directories
+```
+
+## `SKILL.md` format
+
+The `SKILL.md` file must contain YAML frontmatter followed by Markdown content.
+
+### Frontmatter
+
+| Field           | Required | Constraints                                                                                                       |
+| --------------- | -------- | ----------------------------------------------------------------------------------------------------------------- |
+| `name`          | Yes      | Max 64 characters. Lowercase letters, numbers, and hyphens only. Must not start or end with a hyphen.             |
+| `description`   | Yes      | Max 1024 characters. Non-empty. Describes what the skill does and when to use it.                                 |
+| `license`       | No       | License name or reference to a bundled license file.                                                              |
+| `compatibility` | No       | Max 500 characters. Indicates environment requirements (intended product, system packages, network access, etc.). |
+| `metadata`      | No       | Arbitrary key-value mapping for additional metadata.                                                              |
+| `allowed-tools` | No       | Space-delimited list of pre-approved tools the skill may use. (Experimental)                                      |
+
+<Card>
+  **Minimal example:**
+
+  ```markdown SKILL.md theme={null}
+  ---
+  name: skill-name
+  description: A description of what this skill does and when to use it.
+  ---
+  ```
+
+  **Example with optional fields:**
+
+  ```markdown SKILL.md theme={null}
+  ---
+  name: pdf-processing
+  description: Extract PDF text, fill forms, merge files. Use when handling PDFs.
+  license: Apache-2.0
+  metadata:
+    author: example-org
+    version: "1.0"
+  ---
+  ```
+</Card>
+
+#### `name` field
+
+The required `name` field:
+
+* Must be 1-64 characters
+* May only contain unicode lowercase alphanumeric characters (`a-z`) and hyphens (`-`)
+* Must not start or end with a hyphen (`-`)
+* Must not contain consecutive hyphens (`--`)
+* Must match the parent directory name
+
+<Card>
+  **Valid examples:**
+
+  ```yaml  theme={null}
+  name: pdf-processing
+  ```
+
+  ```yaml  theme={null}
+  name: data-analysis
+  ```
+
+  ```yaml  theme={null}
+  name: code-review
+  ```
+
+  **Invalid examples:**
+
+  ```yaml  theme={null}
+  name: PDF-Processing  # uppercase not allowed
+  ```
+
+  ```yaml  theme={null}
+  name: -pdf  # cannot start with hyphen
+  ```
+
+  ```yaml  theme={null}
+  name: pdf--processing  # consecutive hyphens not allowed
+  ```
+</Card>
+
+#### `description` field
+
+The required `description` field:
+
+* Must be 1-1024 characters
+* Should describe both what the skill does and when to use it
+* Should include specific keywords that help agents identify relevant tasks
+
+<Card>
+  **Good example:**
+
+  ```yaml  theme={null}
+  description: Extracts text and tables from PDF files, fills PDF forms, and merges multiple PDFs. Use when working with PDF documents or when the user mentions PDFs, forms, or document extraction.
+  ```
+
+  **Poor example:**
+
+  ```yaml  theme={null}
+  description: Helps with PDFs.
+  ```
+</Card>
+
+#### `license` field
+
+The optional `license` field:
+
+* Specifies the license applied to the skill
+* We recommend keeping it short (either the name of a license or the name of a bundled license file)
+
+<Card>
+  **Example:**
+
+  ```yaml  theme={null}
+  license: Proprietary. LICENSE.txt has complete terms
+  ```
+</Card>
+
+#### `compatibility` field
+
+The optional `compatibility` field:
+
+* Must be 1-500 characters if provided
+* Should only be included if your skill has specific environment requirements
+* Can indicate intended product, required system packages, network access needs, etc.
+
+<Card>
+  **Examples:**
+
+  ```yaml  theme={null}
+  compatibility: Designed for Claude Code (or similar products)
+  ```
+
+  ```yaml  theme={null}
+  compatibility: Requires git, docker, jq, and access to the internet
+  ```
+
+  ```yaml  theme={null}
+  compatibility: Requires Python 3.14+ and uv
+  ```
+</Card>
+
+<Note>
+  Most skills do not need the `compatibility` field.
+</Note>
+
+#### `metadata` field
+
+The optional `metadata` field:
+
+* A map from string keys to string values
+* Clients can use this to store additional properties not defined by the Agent Skills spec
+* We recommend making your key names reasonably unique to avoid accidental conflicts
+
+<Card>
+  **Example:**
+
+  ```yaml  theme={null}
+  metadata:
+    author: example-org
+    version: "1.0"
+  ```
+</Card>
+
+#### `allowed-tools` field
+
+The optional `allowed-tools` field:
+
+* A space-delimited list of tools that are pre-approved to run
+* Experimental. Support for this field may vary between agent implementations
+
+<Card>
+  **Example:**
+
+  ```yaml  theme={null}
+  allowed-tools: Bash(git:*) Bash(jq:*) Read
+  ```
+</Card>
+
+### Body content
+
+The Markdown body after the frontmatter contains the skill instructions. There are no format restrictions. Write whatever helps agents perform the task effectively.
+
+Recommended sections:
+
+* Step-by-step instructions
+* Examples of inputs and outputs
+* Common edge cases
+
+Note that the agent will load this entire file once it's decided to activate a skill. Consider splitting longer `SKILL.md` content into referenced files.
+
+## Optional directories
+
+### `scripts/`
+
+Contains executable code that agents can run. Scripts should:
+
+* Be self-contained or clearly document dependencies
+* Include helpful error messages
+* Handle edge cases gracefully
+
+Supported languages depend on the agent implementation. Common options include Python, Bash, and JavaScript.
+
+### `references/`
+
+Contains additional documentation that agents can read when needed:
+
+* `REFERENCE.md` - Detailed technical reference
+* `FORMS.md` - Form templates or structured data formats
+* Domain-specific files (`finance.md`, `legal.md`, etc.)
+
+Keep individual [reference files](#file-references) focused. Agents load these on demand, so smaller files mean less use of context.
+
+### `assets/`
+
+Contains static resources:
+
+* Templates (document templates, configuration templates)
+* Images (diagrams, examples)
+* Data files (lookup tables, schemas)
+
+## Progressive disclosure
+
+Skills should be structured for efficient use of context:
+
+1. **Metadata** (\~100 tokens): The `name` and `description` fields are loaded at startup for all skills
+2. **Instructions** (\< 5000 tokens recommended): The full `SKILL.md` body is loaded when the skill is activated
+3. **Resources** (as needed): Files (e.g. those in `scripts/`, `references/`, or `assets/`) are loaded only when required
+
+Keep your main `SKILL.md` under 500 lines. Move detailed reference material to separate files.
+
+## File references
+
+When referencing other files in your skill, use relative paths from the skill root:
+
+```markdown SKILL.md theme={null}
+See [the reference guide](references/REFERENCE.md) for details.
+
+Run the extraction script:
+scripts/extract.py
+```
+
+Keep file references one level deep from `SKILL.md`. Avoid deeply nested reference chains.
+
+## Validation
+
+Use the [skills-ref](https://github.com/agentskills/agentskills/tree/main/skills-ref) reference library to validate your skills:
+
+```bash  theme={null}
+skills-ref validate ./my-skill
+```
+
+This checks that your `SKILL.md` frontmatter is valid and follows all naming conventions.
+
+
+Built with [Mintlify](https://mintlify.com).
@@ -0,0 +1,303 @@
+> ## Documentation Index
+> Fetch the complete documentation index at: https://agentskills.io/llms.txt
+> Use this file to discover all available pages before exploring further.
+
+# Evaluating skill output quality
+
+> How to test whether your skill produces good outputs using eval-driven iteration.
+
+You wrote a skill, tried it on a prompt, and it seemed to work. But does it work reliably — across varied prompts, in edge cases, better than no skill at all? Running structured evaluations (evals) answers these questions and gives you a feedback loop for improving the skill systematically.
+
+## Designing test cases
+
+A test case has three parts:
+
+* **Prompt**: a realistic user message — the kind of thing someone would actually type.
+* **Expected output**: a human-readable description of what success looks like.
+* **Input files** (optional): files the skill needs to work with.
+
+Store test cases in `evals/evals.json` inside your skill directory:
+
+```json evals/evals.json theme={null}
+{
+  "skill_name": "csv-analyzer",
+  "evals": [
+    {
+      "id": 1,
+      "prompt": "I have a CSV of monthly sales data in data/sales_2025.csv. Can you find the top 3 months by revenue and make a bar chart?",
+      "expected_output": "A bar chart image showing the top 3 months by revenue, with labeled axes and values.",
+      "files": ["evals/files/sales_2025.csv"]
+    },
+    {
+      "id": 2,
+      "prompt": "there's a csv in my downloads called customers.csv, some rows have missing emails — can you clean it up and tell me how many were missing?",
+      "expected_output": "A cleaned CSV with missing emails handled, plus a count of how many were missing.",
+      "files": ["evals/files/customers.csv"]
+    }
+  ]
+}
+```
+
+**Tips for writing good test prompts:**
+
+* **Start with 2-3 test cases.** Don't over-invest before you've seen your first round of results. You can expand the set later.
+* **Vary the prompts.** Use different phrasings, levels of detail, and formality. Some prompts should be casual ("hey can you clean up this csv"), others precise ("Parse the CSV at data/input.csv, drop rows where column B is null, and write the result to data/output.csv").
+* **Cover edge cases.** Include at least one prompt that tests a boundary condition — a malformed input, an unusual request, or a case where the skill's instructions might be ambiguous.
+* **Use realistic context.** Real users mention file paths, column names, and personal context. Prompts like "process this data" are too vague to test anything useful.
+
+Don't worry about defining specific pass/fail checks yet — just the prompts and expected outputs. You'll add detailed checks (called assertions) after you see what the first run produces.
+
+## Running evals
+
+The core pattern is to run each test case twice: once **with the skill** and once **without it** (or with a previous version). This gives you a baseline to compare against.
+
+### Workspace structure
+
+Organize eval results in a workspace directory alongside your skill directory. Each pass through the full eval loop gets its own `iteration-N/` directory. Within that, each test case gets an eval directory with `with_skill/` and `without_skill/` subdirectories:
+
+```
+csv-analyzer/
+├── SKILL.md
+└── evals/
+    └── evals.json
+csv-analyzer-workspace/
+└── iteration-1/
+    ├── eval-top-months-chart/
+    │   ├── with_skill/
+    │   │   ├── outputs/       # Files produced by the run
+    │   │   ├── timing.json    # Tokens and duration
+    │   │   └── grading.json   # Assertion results
+    │   └── without_skill/
+    │       ├── outputs/
+    │       ├── timing.json
+    │       └── grading.json
+    ├── eval-clean-missing-emails/
+    │   ├── with_skill/
+    │   │   ├── outputs/
+    │   │   ├── timing.json
+    │   │   └── grading.json
+    │   └── without_skill/
+    │       ├── outputs/
+    │       ├── timing.json
+    │       └── grading.json
+    └── benchmark.json         # Aggregated statistics
+```
+
+The main file you author by hand is `evals/evals.json`. The other JSON files (`grading.json`, `timing.json`, `benchmark.json`) are produced during the eval process — by the agent, by scripts, or by you.
+
+### Spawning runs
+
+Each eval run should start with a clean context — no leftover state from previous runs or from the skill development process. This ensures the agent follows only what the `SKILL.md` tells it. In environments that support subagents (Claude Code, for example), this isolation comes naturally: each child task starts fresh. Without subagents, use a separate session for each run.
+
+For each run, provide:
+
+* The skill path (or no skill for the baseline)
+* The test prompt
+* Any input files
+* The output directory
+
+Here's an example of the instructions you'd give the agent for a single with-skill run:
+
+```
+Execute this task:
+- Skill path: /path/to/csv-analyzer
+- Task: I have a CSV of monthly sales data in data/sales_2025.csv.
+  Can you find the top 3 months by revenue and make a bar chart?
+- Input files: evals/files/sales_2025.csv
+- Save outputs to: csv-analyzer-workspace/iteration-1/eval-top-months-chart/with_skill/outputs/
+```
+
+For the baseline, use the same prompt but without the skill path, saving to `without_skill/outputs/`.
+
+When improving an existing skill, use the previous version as your baseline. Snapshot it before editing (`cp -r <skill-path> <workspace>/skill-snapshot/`), point the baseline run at the snapshot, and save to `old_skill/outputs/` instead of `without_skill/`.
+
+### Capturing timing data
+
+Timing data lets you compare how much time and tokens the skill costs relative to the baseline — a skill that dramatically improves output quality but triples token usage is a different trade-off than one that's both better and cheaper. When each run completes, record the token count and duration:
+
+```json timing.json theme={null}
+{
+  "total_tokens": 84852,
+  "duration_ms": 23332
+}
+```
+
+<Tip>
+  In Claude Code, when a subagent task finishes, the [task completion notification](https://platform.claude.com/docs/en/agent-sdk/typescript#sdk-task-notification-message) includes `total_tokens` and `duration_ms`. Save these values immediately — they aren't persisted anywhere else.
+</Tip>
+
+## Writing assertions
+
+Assertions are verifiable statements about what the output should contain or achieve. Add them after you see your first round of outputs — you often don't know what "good" looks like until the skill has run.
+
+Good assertions:
+
+* `"The output file is valid JSON"` — programmatically verifiable.
+* `"The bar chart has labeled axes"` — specific and observable.
+* `"The report includes at least 3 recommendations"` — countable.
+
+Weak assertions:
+
+* `"The output is good"` — too vague to grade.
+* `"The output uses exactly the phrase 'Total Revenue: $X'"` — too brittle; correct output with different wording would fail.
+
+Not everything needs an assertion. Some qualities — writing style, visual design, whether the output "feels right" — are hard to decompose into pass/fail checks. These are better caught during [human review](#reviewing-results-with-a-human). Reserve assertions for things that can be checked objectively.
+
+Add assertions to each test case in `evals/evals.json`:
+
+```json evals/evals.json highlight={9-14} theme={null}
+{
+  "skill_name": "csv-analyzer",
+  "evals": [
+    {
+      "id": 1,
+      "prompt": "I have a CSV of monthly sales data in data/sales_2025.csv. Can you find the top 3 months by revenue and make a bar chart?",
+      "expected_output": "A bar chart image showing the top 3 months by revenue, with labeled axes and values.",
+      "files": ["evals/files/sales_2025.csv"],
+      "assertions": [
+        "The output includes a bar chart image file",
+        "The chart shows exactly 3 months",
+        "Both axes are labeled",
+        "The chart title or caption mentions revenue"
+      ]
+    }
+  ]
+}
+```
+
+## Grading outputs
+
+Grading means evaluating each assertion against the actual outputs and recording **PASS** or **FAIL** with specific evidence. The evidence should quote or reference the output, not just state an opinion.
+
+The simplest approach is to give the outputs and assertions to an LLM and ask it to evaluate each one. For assertions that can be checked by code (valid JSON, correct row count, file exists with expected dimensions), use a verification script — scripts are more reliable than LLM judgment for mechanical checks and reusable across iterations.
+
+```json grading.json theme={null}
+{
+  "assertion_results": [
+    {
+      "text": "The output includes a bar chart image file",
+      "passed": true,
+      "evidence": "Found chart.png (45KB) in outputs directory"
+    },
+    {
+      "text": "The chart shows exactly 3 months",
+      "passed": true,
+      "evidence": "Chart displays bars for March, July, and November"
+    },
+    {
+      "text": "Both axes are labeled",
+      "passed": false,
+      "evidence": "Y-axis is labeled 'Revenue ($)' but X-axis has no label"
+    },
+    {
+      "text": "The chart title or caption mentions revenue",
+      "passed": true,
+      "evidence": "Chart title reads 'Top 3 Months by Revenue'"
+    }
+  ],
+  "summary": {
+    "passed": 3,
+    "failed": 1,
+    "total": 4,
+    "pass_rate": 0.75
+  }
+}
+```
+
+### Grading principles
+
+* **Require concrete evidence for a PASS.** Don't give the benefit of the doubt. If an assertion says "includes a summary" and the output has a section titled "Summary" with one vague sentence, that's a FAIL — the label is there but the substance isn't.
+* **Review the assertions themselves, not just the results.** While grading, notice when assertions are too easy (always pass regardless of skill quality), too hard (always fail even when the output is good), or unverifiable (can't be checked from the output alone). Fix these for the next iteration.
+
+<Tip>
+  For comparing two skill versions, try **blind comparison**: present both outputs to an LLM judge without revealing which came from which version. The judge scores holistic qualities — organization, formatting, usability, polish — on its own rubric, free from bias about which version "should" be better. This complements assertion grading: two outputs might both pass all assertions but differ significantly in overall quality.
+</Tip>
+
+## Aggregating results
+
+Once every run in the iteration is graded, compute summary statistics per configuration and save them to `benchmark.json` alongside the eval directories (e.g., `csv-analyzer-workspace/iteration-1/benchmark.json`):
+
+```json benchmark.json theme={null}
+{
+  "run_summary": {
+    "with_skill": {
+      "pass_rate": { "mean": 0.83, "stddev": 0.06 },
+      "time_seconds": { "mean": 45.0, "stddev": 12.0 },
+      "tokens": { "mean": 3800, "stddev": 400 }
+    },
+    "without_skill": {
+      "pass_rate": { "mean": 0.33, "stddev": 0.10 },
+      "time_seconds": { "mean": 32.0, "stddev": 8.0 },
+      "tokens": { "mean": 2100, "stddev": 300 }
+    },
+    "delta": {
+      "pass_rate": 0.50,
+      "time_seconds": 13.0,
+      "tokens": 1700
+    }
+  }
+}
+```
+
+The `delta` tells you what the skill costs (more time, more tokens) and what it buys (higher pass rate). A skill that adds 13 seconds but improves pass rate by 50 percentage points is probably worth it. A skill that doubles token usage for a 2-point improvement might not be.
+
+<Note>
+  Standard deviation (`stddev`) is only meaningful with multiple runs per eval. In early iterations with just 2-3 test cases and single runs, focus on the raw pass counts and the delta — the statistical measures become useful as you expand the test set and run each eval multiple times.
+</Note>
+
+## Analyzing patterns
+
+Aggregate statistics can hide important patterns. After computing the benchmarks:
+
+* **Remove or replace assertions that always pass in both configurations.** These don't tell you anything useful — the model handles them fine without the skill. They inflate the with-skill pass rate without reflecting actual skill value.
+* **Investigate assertions that always fail in both configurations.** Either the assertion is broken (asking for something the model can't do), the test case is too hard, or the assertion is checking for the wrong thing. Fix these before the next iteration.
+* **Study assertions that pass with the skill but fail without.** This is where the skill is clearly adding value. Understand *why* — which instructions or scripts made the difference?
+* **Tighten instructions when results are inconsistent across runs.** If the same eval passes sometimes and fails others (reflected as high `stddev` in the benchmark), the eval may be flaky (sensitive to model randomness), or the skill's instructions may be ambiguous enough that the model interprets them differently each time. Add examples or more specific guidance to reduce ambiguity.
+* **Check time and token outliers.** If one eval takes 3x longer than the others, read its execution transcript (the full log of what the model did during the run) to find the bottleneck.
+
+## Reviewing results with a human
+
+Assertion grading and pattern analysis catch a lot, but they only check what you thought to write assertions for. A human reviewer brings a fresh perspective — catching issues you didn't anticipate, noticing when the output is technically correct but misses the point, or spotting problems that are hard to express as pass/fail checks. For each test case, review the actual outputs alongside the grades.
+
+Record specific feedback for each test case and save it in the workspace (e.g., as a `feedback.json` alongside the eval directories):
+
+```json feedback.json theme={null}
+{
+  "eval-top-months-chart": "The chart is missing axis labels and the months are in alphabetical order instead of chronological.",
+  "eval-clean-missing-emails": ""
+}
+```
+
+"The chart is missing axis labels" is actionable; "looks bad" is not. Empty feedback means the output looked fine — that test case passed your review. During the [iteration step](#iterating-on-the-skill), focus your improvements on the test cases where you had specific complaints.
+
+## Iterating on the skill
+
+After grading and reviewing, you have three sources of signal:
+
+* **Failed assertions** point to specific gaps — a missing step, an unclear instruction, or a case the skill doesn't handle.
+* **Human feedback** points to broader quality issues — the approach was wrong, the output was poorly structured, or the skill produced a technically correct but unhelpful result.
+* **Execution transcripts** reveal *why* things went wrong. If the agent ignored an instruction, the instruction may be ambiguous. If the agent spent time on unproductive steps, those instructions may need to be simplified or removed.
+
+The most effective way to turn these signals into skill improvements is to give all three — along with the current `SKILL.md` — to an LLM and ask it to propose changes. The LLM can synthesize patterns across failed assertions, reviewer complaints, and transcript behavior that would be tedious to connect manually. When prompting the LLM, include these guidelines:
+
+* **Generalize from feedback.** The skill will be used across many different prompts, not just the test cases. Fixes should address underlying issues broadly rather than adding narrow patches for specific examples.
+* **Keep the skill lean.** Fewer, better instructions often outperform exhaustive rules. If transcripts show wasted work (unnecessary validation, unneeded intermediate outputs), remove those instructions. If pass rates plateau despite adding more rules, the skill may be over-constrained — try removing instructions and see if results hold or improve.
+* **Explain the why.** Reasoning-based instructions ("Do X because Y tends to cause Z") work better than rigid directives ("ALWAYS do X, NEVER do Y"). Models follow instructions more reliably when they understand the purpose.
+* **Bundle repeated work.** If every test run independently wrote a similar helper script (a chart builder, a data parser), that's a signal to bundle the script into the skill's `scripts/` directory. See [Using scripts](/skill-creation/using-scripts) for how to do this.
+
+### The loop
+
+1. Give the eval signals and current `SKILL.md` to an LLM and ask it to propose improvements.
+2. Review and apply the changes.
+3. Rerun all test cases in a new `iteration-<N+1>/` directory.
+4. Grade and aggregate the new results.
+5. Review with a human. Repeat.
+
+Stop when you're satisfied with the results, feedback is consistently empty, or you're no longer seeing meaningful improvement between iterations.
+
+<Tip>
+  The [`skill-creator`](https://github.com/anthropics/skills/tree/main/skills/skill-creator) Skill automates much of this workflow — running evals, grading assertions, aggregating benchmarks, and presenting results for human review.
+</Tip>
+
+
+Built with [Mintlify](https://mintlify.com).