chore: restructure skills repo with new agents and skill bundles

- Add new skills: deep-dive, docs-rag, meta-creator, ppt-maker, sdlc - Add agent configs: g-assistent, meta-creator, sdlc with prompt files - Add reference docs for custom agents and skills specification - Add utility scripts: install-agents.sh, orchestrate.py, puml2svg.sh - Update README and commit-message skill config - Remove deprecated skills: codereview, python, testing, typescript - Add .gitignore
2026-04-18 13:07:46 +08:00
parent 72f16d26b8
commit c0d14c6ac1
74 changed files with 5726 additions and 324 deletions
@@ -0,0 +1,77 @@
+# meta-creator
+
+A Kiro agent skill for creating and iteratively improving agent skills and custom agents.
+
+## Architecture
+
+![Architecture](assets/meta-creator-architecture.svg)
+
+## Workflow
+
+![Workflow](assets/meta-creator-workflow.svg)
+
+## What It Does
+
+- Creates new `SKILL.md` files with proper frontmatter and instructions
+- Creates `evals/evals.json` with at least 3 eval cases
+- Creates or updates Kiro custom agent configs (`.kiro/agents/<name>.json`)
+- Runs eval-driven iteration: analyzes failures and improves skills
+
+## When to Use
+
+Trigger phrases: "create a skill", "make a skill", "new skill", "update skill", "improve skill", "create an agent", "new agent", "update agent", "创建skill", "创建技能", "新建skill", "更新skill", "优化skill", "创建agent", "新建agent", "更新agent"
+
+## Workflow Steps
+
+1. **Gather requirements** — what the skill does, example tasks, environment needs
+2. **Create `SKILL.md`** — frontmatter (`name`, `description`) + step-by-step instructions
+3. **Create `evals/evals.json`** — happy path, variation, and edge case
+4. **Iterate** — if eval results are provided, fix instruction gaps and update assertions
+5. **Create agent** (optional) — `.kiro/agents/<name>.json` with prompt, tools, and skill references
+
+## Outputs
+
+| File | Description |
+|------|-------------|
+| `skills/<name>/SKILL.md` | Skill instructions |
+| `skills/<name>/evals/evals.json` | Eval cases |
+| `.kiro/agents/<name>.json` | Agent config (only if requested) |
+| `.kiro/agents/prompts/<name>.md` | Agent prompt file |
+
+## File Structure
+
+```
+skills/meta-creator/
+├── SKILL.md
+├── README.md                          # this file
+├── assets/
+│   ├── workflow.puml
+│   └── meta-creator-workflow.svg
+├── evals/
+│   └── evals.json
+└── references/
+    ├── skills-Specification.md
+    ├── skills-eval.md
+    ├── custom-agents-configuration-reference.md
+    └── kiro-cli-chat-configuration.md
+```
+
+## Example Prompts
+
+```
+Create a skill that generates SQL queries from natural language descriptions.
+```
+
+```
+Update the commit-message skill to also support Angular commit conventions.
+```
+
+```
+Create a new agent called "db-helper" that uses the sql-gen skill.
+```
+
+## Evals
+
+```bash
+python scripts/run_evals.py meta-creator
+```
@@ -0,0 +1,232 @@
+---
+name: meta-creator
+description: Creates and iteratively improves agent skills and custom agents. Use when a user wants to create a new skill, update an existing skill, create a new agent, or run eval-driven iteration. Triggers on phrases like "create a skill", "make a skill", "new skill", "update skill", "improve skill", "create an agent", "new agent", "update agent", "创建skill", "创建技能", "新建skill", "更新skill", "优化skill", "创建agent", "新建agent", "更新agent".
+metadata:
+  author: common-skills
+  version: "1.0"
+---
+
+# Meta Creator
+
+Create or update agent skills and custom agents. Skills conform to the [Agent Skills specification](references/skills-Specification.md). Agents conform to the [Kiro custom agent configuration](references/custom-agents-configuration-reference.md). For eval-driven iteration, follow the [eval methodology](references/skills-eval.md). For Kiro CLI configuration scopes, file paths, and conflict resolution rules, refer to the [Kiro CLI Chat configuration](references/kiro-cli-chat-configuration.md).
+
+## References
+
+- [skills-Specification.md](references/skills-Specification.md) — SKILL.md format, frontmatter rules, directory structure
+- [skills-eval.md](references/skills-eval.md) — eval design, grading, iteration methodology
+- [custom-agents-configuration-reference.md](references/custom-agents-configuration-reference.md) — Kiro agent JSON config fields
+- [kiro-cli-chat-configuration.md](references/kiro-cli-chat-configuration.md) — Kiro CLI configuration scopes (global/project/agent), file paths, and conflict resolution priority
+
+## Inputs
+
+The user will provide one of:
+- A description of what the new skill should do
+- An existing skill directory to update or improve
+- Eval results / feedback to incorporate into an existing skill
+
+## Workflow
+
+### 1. Gather Requirements
+
+Ask the user (or infer from context):
+- What does the skill do? When should it activate?
+- What are 2–3 concrete example tasks it should handle?
+- Any environment requirements (tools, packages, network)?
+
+### 2. Create or Update `SKILL.md`
+
+**Frontmatter rules:**
+- `name`: lowercase, hyphens only, matches directory name, max 64 chars
+- `description`: describes what it does AND when to use it; include trigger phrases; max 1024 chars
+- Add `compatibility` only if the skill has real environment requirements
+- Add `metadata` (author, version) for team skills
+
+**Body content:**
+- Write clear step-by-step instructions the agent will follow
+- Include concrete examples of inputs and expected outputs
+- Cover the 2–3 most important edge cases
+- Keep under 500 lines; move detailed reference material to `references/`
+
+### 3. Create `evals/evals.json`
+
+Write at least 3 eval cases covering:
+- A typical happy-path use case
+- A variation with different phrasing or context
+- An edge case (unusual input, boundary condition, or ambiguous request)
+
+Each eval case must have:
+- `id`: integer
+- `prompt`: realistic user message (not "process this data" — use specific context)
+- `expected_output`: human-readable description of what success looks like
+
+Add `assertions` after the first eval run reveals what "good" looks like.
+
+Format:
+```json
+{
+  "skill_name": "<name>",
+  "evals": [
+    {
+      "id": 1,
+      "prompt": "...",
+      "expected_output": "..."
+    }
+  ]
+}
+```
+
+### 4. Create or Update `README.md` and Diagrams
+
+After creating or updating a skill, create (or update) `skills/<name>/README.md` and generate two PlantUML diagrams:
+
+**Architecture diagram** (`assets/architecture.puml`) — static component view:
+- Show the skill's files and their roles (SKILL.md, references/, assets/, evals/)
+- Show external dependencies (tools, APIs, databases, other files the skill reads/writes)
+- Use `package` blocks to group related components; use `component`, `database`, `actor`
+
+**Workflow diagram** (`assets/workflow.puml`) — dynamic sequence view:
+- Show the interaction between the user, the skill, and any external systems step by step
+- Use `participant` / `actor` and sequence arrows (`->`, `-->`)
+- Include branching (`alt`/`opt`) for key decision points
+
+**Convert to SVG:**
+```bash
+bash scripts/puml2svg.sh <name>
+```
+This requires Java and Graphviz. The PlantUML jar is resolved automatically from the VS Code extension; override with `PLANTUML_JAR=/path/to/plantuml.jar`.
+
+**README structure:**
+```markdown
+# <skill-name>
+
+One-line description.
+
+## Architecture
+![Architecture](assets/<name>-architecture.svg)
+
+## Workflow
+![Workflow](assets/<name>-workflow.svg)
+
+## When to Use
+...
+
+## How It Works
+...
+
+## File Structure
+...
+
+## Evals
+\`\`\`bash
+python scripts/run_evals.py <name>
+\`\`\`
+```
+
+### 5. Iterative Improvement (if eval results are provided)
+
+When the user provides eval results, grading output, or human feedback:
+
+1. Identify which assertions failed and why (read execution transcripts if available)
+2. Distinguish between:
+   - **Instruction gaps**: the skill didn't tell the agent to do something it should
+   - **Ambiguous instructions**: the agent interpreted instructions inconsistently
+   - **Wrong assertions**: the assertion was too strict, too vague, or checking the wrong thing
+3. Propose targeted changes to `SKILL.md`:
+   - Generalize fixes — don't patch for a single test case
+   - Remove instructions that caused wasted work
+   - Add reasoning ("Do X because Y") rather than rigid directives
+4. Update `evals/evals.json` to fix broken assertions and add new cases for uncovered scenarios
+
+### 6. Create or Update a Custom Agent (if requested)
+
+When the user wants a new or updated Kiro agent (`.kiro/agents/<name>.json`):
+
+**Required fields:**
+- `name`: descriptive, matches the filename (without `.json`)
+- `description`: what the agent does and when to use it
+- `prompt`: concise system prompt; delegate detail to skill resources where possible
+- `tools`: only include tools the agent actually needs
+- `allowedTools`: read-only tools are safe to auto-allow; tools that write files or run commands should require confirmation (omit from `allowedTools`)
+
+**Help/greeting response:** The agent's prompt file MUST include instructions to respond to greetings and help requests (e.g., "hi", "hello", "help", "你好", "帮助", "?") with a structured introduction covering:
+- What the agent does (one-line summary)
+- Key capabilities (bullet list)
+- How the agent works step-by-step (execution flow)
+- 2–3 concrete example prompts
+
+Example prompt section to include:
+```
+When the user sends a greeting or help request (e.g., "hi", "hello", "help", "你好", "帮助", "?"), respond with:
+
+---
+👋 **<Agent Name>** — <one-line description>
+
+**功能：**
+- <capability 1>
+- <capability 2>
+
+**执行步骤：**
+1. <step 1>
+2. <step 2>
+3. <step 3>
+
+**使用示例：**
+- `<example prompt 1>`
+- `<example prompt 2>`
+---
+```
+
+**Resources:**
+- Use `skill://` for skills (lazy-loads, saves context)
+- Use `file://` only for small reference docs needed at startup
+
+**Output location:** `.kiro/agents/<name>.json`
+
+**Prompt file:** Extract the prompt to `file://prompts/<name>.md` (relative to `.kiro/agents/`) and reference it as `"prompt": "file://prompts/<name>.md"` to keep the JSON clean.
+
+**Skill install path:** Skills are installed under `.kiro/skills/<name>/`. Reference them as `skill://.kiro/skills/**/SKILL.md` (or a specific path). The `skill://` protocol loads only name/description metadata at startup and fetches full content on demand.
+
+### 7. Post-Creation: Agent Setup (after creating a new skill)
+
+After successfully creating a new skill, ask the user:
+
+> "Do you want a dedicated agent to invoke this skill? If not, it will be available to the `g-assistent` agent by default."
+
+- If **yes**: proceed with Step 5 to create a `.kiro/agents/<name>.json` for the skill.
+- If **no**: inform the user that `g-assistent` will route to this skill automatically based on its `description` trigger phrases.
+
+### 8. Post-Agent Checkpoint: Update install-agents.sh
+
+After creating or updating any agent, check whether `scripts/install-agents.sh` needs updating:
+
+1. Read `scripts/install-agents.sh` (if it exists in the repo root).
+2. Check if the script handles:
+   - Any `file://prompts/<name>.md` references — the script must copy prompt files to the target `prompts/` directory
+   - Any new skill references that require special handling
+3. If a gap is found, update `scripts/install-agents.sh` and tell the user what changed.
+4. If no changes are needed, briefly confirm: "install-agents.sh is up to date."
+
+## Output
+
+- `skills/<name>/SKILL.md` — the skill file
+- `skills/<name>/evals/evals.json` — eval cases
+- `skills/<name>/README.md` — documentation with architecture and workflow diagrams
+- `skills/<name>/assets/architecture.puml` + `architecture.svg` — static component diagram
+- `skills/<name>/assets/workflow.puml` + `workflow.svg` — dynamic sequence diagram
+- `.kiro/agents/<name>.json` — the agent config (only if user requests a dedicated agent)
+- `.kiro/agents/prompts/<name>.md` — the agent prompt file (extracted from JSON)
+
+If creating a new skill, also suggest the directory structure needed (scripts/, references/, assets/) based on the skill's requirements.
+
+## Quality Checklist
+
+Before finishing, verify:
+- [ ] `name` matches the directory name exactly
+- [ ] `description` includes both what it does and when to activate (trigger phrases)
+- [ ] Body instructions are actionable, not vague
+- [ ] At least 3 eval cases with varied prompts
+- [ ] No eval prompt is too generic (e.g., "test this skill")
+- [ ] SKILL.md is under 500 lines
+- [ ] `README.md` exists with Architecture and Workflow sections
+- [ ] `assets/architecture.puml` and `assets/workflow.puml` exist and SVGs are generated
+- [ ] Agent prompt includes a greeting/help response with capabilities and example prompts (for new agents)
@@ -0,0 +1,32 @@
+@startuml meta-creator-architecture
+skinparam componentStyle rectangle
+skinparam defaultFontName Arial
+skinparam backgroundColor #FAFAFA
+
+package "meta-creator Skill" {
+  component "SKILL.md\n(instructions)" as SKILL
+  component "references/\nskills-Specification.md" as SPEC
+  component "references/\nskills-eval.md" as EVAL_REF
+  component "references/\ncustom-agents-configuration-reference.md" as AGENT_REF
+  component "references/\nkiro-cli-chat-configuration.md" as CLI_REF
+  component "evals/evals.json" as EVALS
+}
+
+package "Outputs" {
+  component "skills/<name>/SKILL.md" as OUT_SKILL
+  component "skills/<name>/evals/evals.json" as OUT_EVALS
+  component ".kiro/agents/<name>.json" as OUT_AGENT #lightblue
+  component ".kiro/agents/prompts/<name>.md" as OUT_PROMPT #lightblue
+}
+
+SKILL --> SPEC : skill format rules
+SKILL --> EVAL_REF : eval methodology
+SKILL --> AGENT_REF : agent config schema
+SKILL --> CLI_REF : config scopes & paths
+SKILL --> OUT_SKILL : creates
+SKILL --> OUT_EVALS : creates
+SKILL --> OUT_AGENT : creates (optional)
+SKILL --> OUT_PROMPT : creates (optional)
+
+note right of OUT_AGENT : only if user\nrequests an agent
+@enduml
@@ -0,0 +1,30 @@
+@startuml meta-creator-workflow
+skinparam defaultFontName Arial
+skinparam backgroundColor #FAFAFA
+
+actor Developer
+participant "meta-creator\nSkill" as SKILL
+participant "File System" as FS
+
+Developer -> SKILL : "create a skill: <description>"
+SKILL -> Developer : clarifying questions\n(purpose, examples, env)
+Developer -> SKILL : answers
+
+SKILL -> FS : write skills/<name>/SKILL.md
+SKILL -> FS : write skills/<name>/evals/evals.json
+SKILL --> Developer : skill created
+
+opt eval results provided
+  Developer -> SKILL : eval failures / feedback
+  SKILL -> SKILL : identify gaps vs wrong assertions
+  SKILL -> FS : update SKILL.md
+  SKILL -> FS : update evals.json
+  SKILL --> Developer : improved skill
+end
+
+opt agent requested
+  SKILL -> FS : write .kiro/agents/<name>.json
+  SKILL -> FS : write .kiro/agents/prompts/<name>.md
+  SKILL --> Developer : agent ready
+end
+@enduml
@@ -0,0 +1,30 @@
+{
+  "skill_name": "meta-creator",
+  "evals": [
+    {
+      "id": 1,
+      "prompt": "Create a new skill called 'csv-analyzer' that helps agents analyze CSV files: find summary statistics, detect missing values, and produce a short report.",
+      "expected_output": "A skills/csv-analyzer/SKILL.md with valid frontmatter (name matches directory, description explains what it does and when to use it), clear step-by-step instructions, and a skills/csv-analyzer/evals/evals.json with at least 3 eval cases covering typical use, varied phrasing, and an edge case (e.g. malformed CSV)."
+    },
+    {
+      "id": 2,
+      "prompt": "我想创建一个skill，帮助agent做代码审查，重点检查安全漏洞，比如SQL注入、XSS、硬编码密钥。",
+      "expected_output": "A skills/security-review/SKILL.md with Chinese-friendly trigger phrases in the description, security-focused review checklist in the body, and evals/evals.json with at least 3 cases including SQL injection, XSS, and hardcoded secrets scenarios."
+    },
+    {
+      "id": 3,
+      "prompt": "Here are the eval results for my 'doc-writer' skill. Assertion 'output includes a usage example' failed in 2 out of 3 cases. The agent wrote correct docs but skipped examples. How should I update the skill?",
+      "expected_output": "A targeted update to the doc-writer SKILL.md adding an explicit instruction to always include a usage example with reasoning. Does NOT add unrelated instructions or over-constrain the skill."
+    },
+    {
+      "id": 4,
+      "prompt": "Create a Kiro agent called 'db-expert' that specializes in database tasks. It should use a sql-helper skill and only have read access to files by default.",
+      "expected_output": "A .kiro/agents/db-expert.json with name 'db-expert', a description mentioning database tasks, tools including 'read' but not 'write' in allowedTools, and resources referencing the sql-helper skill via skill:// URI."
+    },
+    {
+      "id": 5,
+      "prompt": "帮我创建一个agent，名字叫 code-reviewer，调用 codereview 这个skill，只允许读文件，不能写。",
+      "expected_output": "A .kiro/agents/code-reviewer.json with name 'code-reviewer', read in allowedTools but write absent from allowedTools, and skill://skills/codereview/SKILL.md in resources."
+    }
+  ]
+}
@@ -0,0 +1,480 @@
+# Kiro CLI Custom Agents — 配置参考
+
+> 原文：https://kiro.dev/docs/cli/custom-agents/configuration-reference/  
+> 更新：2026-04-14
+
+---
+
+## 快速开始
+
+推荐在 Kiro 会话中使用 `/agent generate` 命令，通过 AI 辅助生成 Agent 配置。
+
+---
+
+## 文件位置
+
+### 本地 Agent（项目级）
+
+```
+<project>/.kiro/agents/<name>.json
+```
+
+仅在该目录或其子目录下运行 Kiro CLI 时可用。
+
+### 全局 Agent（用户级）
+
+```
+~/.kiro/agents/<name>.json
+```
+
+在任意目录下均可使用。
+
+### 优先级
+
+同名 Agent 时，**本地优先于全局**（并输出警告）。
+
+---
+
+## 配置字段总览
+
+| 字段 | 说明 |
+|------|------|
+| `name` | Agent 名称（可选，默认取文件名） |
+| `description` | Agent 描述 |
+| `prompt` | 系统提示词（内联或 `file://` URI） |
+| `mcpServers` | 可访问的 MCP 服务器 |
+| `tools` | 可用工具列表 |
+| `toolAliases` | 工具名称重映射 |
+| `allowedTools` | 无需确认即可使用的工具 |
+| `toolsSettings` | 工具专项配置 |
+| `resources` | 可访问的本地资源 |
+| `hooks` | 生命周期钩子命令 |
+| `includeMcpJson` | 是否引入 mcp.json 中的 MCP 服务器 |
+| `model` | 指定使用的模型 ID |
+| `keyboardShortcut` | 快速切换快捷键 |
+| `welcomeMessage` | 切换到该 Agent 时显示的欢迎语 |
+
+---
+
+## 字段详解
+
+### `name`
+
+Agent 的标识名称，用于显示和识别。
+
+```json
+{ "name": "aws-expert" }
+```
+
+---
+
+### `description`
+
+人类可读的描述，帮助区分不同 Agent。
+
+```json
+{ "description": "An agent specialized for AWS infrastructure tasks" }
+```
+
+---
+
+### `prompt`
+
+类似系统提示词，为 Agent 提供高层上下文。支持内联文本或 `file://` URI。
+
+**内联：**
+```json
+{ "prompt": "You are an expert AWS infrastructure specialist" }
+```
+
+**文件引用：**
+```json
+{ "prompt": "file://./prompts/aws-expert.md" }
+```
+
+**路径解析规则：**
+- 相对路径：相对于 Agent 配置文件所在目录
+  - `"file://./prompt.md"` → 同目录
+  - `"file://../shared/prompt.md"` → 上级目录
+- 绝对路径：直接使用
+  - `"file:///home/user/prompts/agent.md"`
+
+---
+
+### `mcpServers`
+
+定义 Agent 可访问的 MCP 服务器。
+
+```json
+{
+  "mcpServers": {
+    "fetch": {
+      "command": "fetch3.1",
+      "args": []
+    },
+    "git": {
+      "command": "git-mcp",
+      "args": [],
+      "env": { "GIT_CONFIG_GLOBAL": "/dev/null" },
+      "timeout": 120000
+    }
+  }
+}
+```
+
+**字段：**
+- `command`（必填）：启动 MCP 服务器的命令
+- `args`（可选）：命令参数
+- `env`（可选）：环境变量
+- `timeout`（可选）：每次请求超时毫秒数，默认 `120000`
+- `oauth`（可选）：HTTP 类型 MCP 服务器的 OAuth 配置
+  - `redirectUri`：自定义重定向 URI
+  - `oauthScopes`：请求的 OAuth 权限范围数组
+
+**OAuth 示例：**
+```json
+{
+  "mcpServers": {
+    "github": {
+      "type": "http",
+      "url": "https://api.github.com/mcp",
+      "oauth": {
+        "redirectUri": "127.0.0.1:8080",
+        "oauthScopes": ["repo", "user"]
+      }
+    }
+  }
+}
+```
+
+---
+
+### `tools`
+
+Agent 可使用的工具列表。
+
+```json
+{
+  "tools": ["read", "write", "shell", "@git", "@rust-analyzer/check_code"]
+}
+```
+
+**引用方式：**
+- 内置工具：`"read"`、`"shell"`
+- MCP 服务器所有工具：`"@server_name"`
+- MCP 服务器特定工具：`"@server_name/tool_name"`
+- 所有工具：`"*"`
+- 所有内置工具：`"@builtin"`
+
+---
+
+### `toolAliases`
+
+重命名工具，解决命名冲突或创建更直观的名称。
+
+```json
+{
+  "toolAliases": {
+    "@github-mcp/get_issues": "github_issues",
+    "@gitlab-mcp/get_issues": "gitlab_issues",
+    "@aws-cloud-formation/deploy_stack_with_parameters": "deploy_cf"
+  }
+}
+```
+
+---
+
+### `allowedTools`
+
+无需用户确认即可使用的工具。支持精确匹配和通配符。
+
+```json
+{
+  "allowedTools": [
+    "read",
+    "@git/git_status",
+    "@server/read_*",
+    "@fetch"
+  ]
+}
+```
+
+**匹配方式：**
+
+| 模式 | 说明 |
+|------|------|
+| `"read"` | 精确匹配内置工具 |
+| `"@server_name/tool_name"` | 精确匹配 MCP 工具 |
+| `"@server_name"` | 该服务器的所有工具 |
+| `"@server/read_*"` | 前缀通配 |
+| `"@server/*_get"` | 后缀通配 |
+| `"@git-*/*"` | 服务器名通配 |
+| `"?ead"` | `?` 匹配单个字符 |
+
+> **注意：** `allowedTools` 不支持 `"*"` 通配所有工具。
+
+---
+
+### `toolsSettings`
+
+对特定工具进行专项配置。
+
+```json
+{
+  "toolsSettings": {
+    "write": {
+      "allowedPaths": ["src/**", "tests/**"]
+    },
+    "shell": {
+      "allowedCommands": ["git status", "git fetch"],
+      "deniedCommands": ["git commit .*", "git push .*"],
+      "autoAllowReadonly": true
+    },
+    "@git/git_status": {
+      "git_user": "$GIT_USER"
+    }
+  }
+}
+```
+
+---
+
+### `resources`
+
+Agent 可访问的本地资源，支持三种类型。
+
+#### 文件资源（`file://`）
+
+启动时直接加载到上下文。
+
+```json
+{
+  "resources": [
+    "file://README.md",
+    "file://docs/**/*.md"
+  ]
+}
+```
+
+#### Skill 资源（`skill://`）
+
+启动时仅加载元数据（name/description），按需加载完整内容，保持上下文精简。
+
+Skill 文件须以 YAML frontmatter 开头：
+
+```markdown
+---
+name: dynamodb-data-modeling
+description: Guide for DynamoDB data modeling best practices.
+---
+
+# DynamoDB Data Modeling
+...
+```
+
+```json
+{
+  "resources": [
+    "skill://.kiro/skills/**/SKILL.md"
+  ]
+}
+```
+
+#### 知识库资源（`knowledgeBase`）
+
+支持对大量文档进行索引检索。
+
+```json
+{
+  "resources": [
+    {
+      "type": "knowledgeBase",
+      "source": "file://./docs",
+      "name": "ProjectDocs",
+      "description": "Project documentation and guides",
+      "indexType": "best",
+      "autoUpdate": true
+    }
+  ]
+}
+```
+
+| 字段 | 必填 | 说明 |
+|------|------|------|
+| `type` | 是 | 固定为 `"knowledgeBase"` |
+| `source` | 是 | 索引路径，使用 `file://` 前缀 |
+| `name` | 是 | 显示名称 |
+| `description` | 否 | 内容描述 |
+| `indexType` | 否 | `"best"`（默认，质量更高）或 `"fast"` |
+| `autoUpdate` | 否 | Agent 启动时重新索引，默认 `false` |
+
+---
+
+### `hooks`
+
+在 Agent 生命周期特定时机执行命令。
+
+```json
+{
+  "hooks": {
+    "agentSpawn": [
+      { "command": "git status" }
+    ],
+    "userPromptSubmit": [
+      { "command": "ls -la" }
+    ],
+    "preToolUse": [
+      {
+        "matcher": "execute_bash",
+        "command": "{ echo \"$(date) - Bash:\"; cat; } >> /tmp/audit.log"
+      }
+    ],
+    "postToolUse": [
+      {
+        "matcher": "fs_write",
+        "command": "cargo fmt --all"
+      }
+    ],
+    "stop": [
+      { "command": "npm test" }
+    ]
+  }
+}
+```
+
+**触发时机：**
+
+| 钩子 | 触发时机 |
+|------|----------|
+| `agentSpawn` | Agent 初始化时 |
+| `userPromptSubmit` | 用户提交消息时 |
+| `preToolUse` | 工具执行前（可阻断） |
+| `postToolUse` | 工具执行后 |
+| `stop` | 助手完成响应时 |
+
+每个 hook 条目：
+- `command`（必填）：要执行的命令
+- `matcher`（可选）：用于 `preToolUse`/`postToolUse` 的工具名匹配模式，使用内部工具名（如 `fs_read`、`fs_write`、`execute_bash`、`use_aws`）
+
+---
+
+### `includeMcpJson`
+
+是否引入 `~/.kiro/settings/mcp.json`（全局）和 `<cwd>/.kiro/settings/mcp.json`（工作区）中定义的 MCP 服务器。
+
+```json
+{ "includeMcpJson": true }
+```
+
+---
+
+### `model`
+
+指定该 Agent 使用的模型 ID。未指定或不可用时回退到默认模型。
+
+```json
+{ "model": "claude-sonnet-4" }
+```
+
+可通过 `/model` 命令查看可用模型列表。
+
+---
+
+### `keyboardShortcut`
+
+快速切换到该 Agent 的键盘快捷键。
+
+```json
+{ "keyboardShortcut": "ctrl+a" }
+```
+
+**格式：** `[modifier+]key`  
+**修饰键：** `ctrl`、`shift`  
+**按键：** `a-z`、`0-9`
+
+- 当前不在该 Agent：切换到该 Agent
+- 已在该 Agent：切换回上一个 Agent
+- 多个 Agent 快捷键冲突时，快捷键被禁用并输出警告
+
+---
+
+### `welcomeMessage`
+
+切换到该 Agent 时显示的欢迎语。
+
+```json
+{ "welcomeMessage": "What would you like to build today?" }
+```
+
+---
+
+## 完整示例
+
+```json
+{
+  "name": "aws-rust-agent",
+  "description": "Specialized agent for AWS and Rust development",
+  "prompt": "file://./prompts/aws-rust-expert.md",
+  "mcpServers": {
+    "fetch": { "command": "fetch-server", "args": [] },
+    "git": { "command": "git-mcp", "args": [] }
+  },
+  "tools": ["read", "write", "shell", "aws", "@git", "@fetch/fetch_url"],
+  "toolAliases": {
+    "@git/git_status": "status",
+    "@fetch/fetch_url": "get"
+  },
+  "allowedTools": ["read", "@git/git_status"],
+  "toolsSettings": {
+    "write": { "allowedPaths": ["src/**", "tests/**", "Cargo.toml"] },
+    "aws": { "allowedServices": ["s3", "lambda"], "autoAllowReadonly": true }
+  },
+  "resources": [
+    "file://README.md",
+    "file://docs/**/*.md"
+  ],
+  "hooks": {
+    "agentSpawn": [{ "command": "git status" }],
+    "postToolUse": [{ "matcher": "fs_write", "command": "cargo fmt --all" }]
+  },
+  "model": "claude-sonnet-4",
+  "keyboardShortcut": "ctrl+shift+r",
+  "welcomeMessage": "Ready to help with AWS and Rust development!"
+}
+```
+
+---
+
+## 最佳实践
+
+### 本地 vs 全局 Agent
+
+| 本地 Agent | 全局 Agent |
+|-----------|-----------|
+| 项目专属配置 | 跨项目通用 Agent |
+| 需要访问项目文件/工具 | 个人效率工具 |
+| 通过版本控制与团队共享 | 常用工具和工作流 |
+
+### 安全建议
+
+- 仔细审查 `allowedTools`，优先使用精确匹配而非通配符
+- 对敏感操作配置 `toolsSettings`（如限制 `allowedPaths`）
+- 启用写工具（`write`、`shell`）时，Agent 拥有与当前用户相同的文件系统权限，可读写 `~/.kiro` 下所有内容
+- 使用 `preToolUse` hooks 审计或阻断敏感操作
+- 在安全环境中充分测试后再共享 Agent
+
+### 组织建议
+
+- 使用描述性名称
+- 在 `description` 中说明用途
+- 将 prompt 文件单独维护
+- 本地 Agent 随项目纳入版本控制
+
+---
+
+## 相关文档
+
+- [创建自定义 Agent](https://kiro.dev/docs/cli/custom-agents/creating/)
+- [内置工具参考](https://kiro.dev/docs/cli/reference/built-in-tools/)
+- [Hooks 文档](https://kiro.dev/docs/cli/hooks)
+- [Agent 示例](https://kiro.dev/docs/cli/custom-agents/examples/)
@@ -0,0 +1,54 @@
+# Kiro CLI Chat — Configuration Reference
+
+> Source: https://kiro.dev/docs/cli/chat/configuration/  
+> Page updated: December 10, 2025
+
+---
+
+## Configuration File Paths
+
+Kiro CLI configuration can be set at three scopes:
+
+1. **Global** — applies across all projects: `~/.kiro/`
+2. **Project** — specific to a project: `<project-root>/.kiro/`
+3. **Agent** — defined in the agent config file: `<user-home | project-root>/.kiro/agents/`
+
+| Configuration | Global Scope | Project Scope |
+|---|---|---|
+| MCP servers | `~/.kiro/settings/mcp.json` | `.kiro/settings/mcp.json` |
+| Prompts | `~/.kiro/prompts` | `.kiro/prompts` |
+| Custom agents | `~/.kiro/agents` | `.kiro/agents` |
+| Steering | `~/.kiro/steering` | `.kiro/steering` |
+| Settings | `~/.kiro/settings/cli.json` | *(N/A)* |
+
+---
+
+## What Can Be Configured at Each Scope
+
+| Configuration | User Scope | Project Scope | Agent Scope |
+|---|---|---|---|
+| MCP servers | Yes | Yes | Yes |
+| Prompts | Yes | Yes | No |
+| Custom agents | Yes | Yes | N/A |
+| Steering | Yes | Yes | Yes |
+| Settings | Yes | N/A | N/A |
+
+---
+
+## Resolving Configuration Conflicts
+
+Conflicts are resolved by selecting the configuration closest to where you are interacting with Kiro CLI.
+
+- If MCP config exists in both global and project `mcp.json`, the project-level config wins when working in that project folder.
+- If a custom agent is defined at both global and project scope, the agent-level configuration takes precedence.
+
+Priority order:
+
+| Configuration | Priority |
+|---|---|
+| MCP servers | Agent > Project > Global |
+| Prompts | Project > Global |
+| Custom agents | Project > Global |
+| Steering | Project > Global |
+
+> **Note:** MCP servers can be configured in three scopes and are handled differently due to the `includeMcpJson` agent setting. See [MCP server loading priority](https://kiro.dev/docs/cli/mcp/#mcp-server-loading-priority).
@@ -0,0 +1,275 @@
+> ## Documentation Index
+> Fetch the complete documentation index at: https://agentskills.io/llms.txt
+> Use this file to discover all available pages before exploring further.
+
+# Specification
+
+> The complete format specification for Agent Skills.
+
+## Directory structure
+
+A skill is a directory containing, at minimum, a `SKILL.md` file:
+
+```
+skill-name/
+├── SKILL.md          # Required: metadata + instructions
+├── scripts/          # Optional: executable code
+├── references/       # Optional: documentation
+├── assets/           # Optional: templates, resources
+└── ...               # Any additional files or directories
+```
+
+## `SKILL.md` format
+
+The `SKILL.md` file must contain YAML frontmatter followed by Markdown content.
+
+### Frontmatter
+
+| Field           | Required | Constraints                                                                                                       |
+| --------------- | -------- | ----------------------------------------------------------------------------------------------------------------- |
+| `name`          | Yes      | Max 64 characters. Lowercase letters, numbers, and hyphens only. Must not start or end with a hyphen.             |
+| `description`   | Yes      | Max 1024 characters. Non-empty. Describes what the skill does and when to use it.                                 |
+| `license`       | No       | License name or reference to a bundled license file.                                                              |
+| `compatibility` | No       | Max 500 characters. Indicates environment requirements (intended product, system packages, network access, etc.). |
+| `metadata`      | No       | Arbitrary key-value mapping for additional metadata.                                                              |
+| `allowed-tools` | No       | Space-delimited list of pre-approved tools the skill may use. (Experimental)                                      |
+
+<Card>
+  **Minimal example:**
+
+  ```markdown SKILL.md theme={null}
+  ---
+  name: skill-name
+  description: A description of what this skill does and when to use it.
+  ---
+  ```
+
+  **Example with optional fields:**
+
+  ```markdown SKILL.md theme={null}
+  ---
+  name: pdf-processing
+  description: Extract PDF text, fill forms, merge files. Use when handling PDFs.
+  license: Apache-2.0
+  metadata:
+    author: example-org
+    version: "1.0"
+  ---
+  ```
+</Card>
+
+#### `name` field
+
+The required `name` field:
+
+* Must be 1-64 characters
+* May only contain unicode lowercase alphanumeric characters (`a-z`) and hyphens (`-`)
+* Must not start or end with a hyphen (`-`)
+* Must not contain consecutive hyphens (`--`)
+* Must match the parent directory name
+
+<Card>
+  **Valid examples:**
+
+  ```yaml  theme={null}
+  name: pdf-processing
+  ```
+
+  ```yaml  theme={null}
+  name: data-analysis
+  ```
+
+  ```yaml  theme={null}
+  name: code-review
+  ```
+
+  **Invalid examples:**
+
+  ```yaml  theme={null}
+  name: PDF-Processing  # uppercase not allowed
+  ```
+
+  ```yaml  theme={null}
+  name: -pdf  # cannot start with hyphen
+  ```
+
+  ```yaml  theme={null}
+  name: pdf--processing  # consecutive hyphens not allowed
+  ```
+</Card>
+
+#### `description` field
+
+The required `description` field:
+
+* Must be 1-1024 characters
+* Should describe both what the skill does and when to use it
+* Should include specific keywords that help agents identify relevant tasks
+
+<Card>
+  **Good example:**
+
+  ```yaml  theme={null}
+  description: Extracts text and tables from PDF files, fills PDF forms, and merges multiple PDFs. Use when working with PDF documents or when the user mentions PDFs, forms, or document extraction.
+  ```
+
+  **Poor example:**
+
+  ```yaml  theme={null}
+  description: Helps with PDFs.
+  ```
+</Card>
+
+#### `license` field
+
+The optional `license` field:
+
+* Specifies the license applied to the skill
+* We recommend keeping it short (either the name of a license or the name of a bundled license file)
+
+<Card>
+  **Example:**
+
+  ```yaml  theme={null}
+  license: Proprietary. LICENSE.txt has complete terms
+  ```
+</Card>
+
+#### `compatibility` field
+
+The optional `compatibility` field:
+
+* Must be 1-500 characters if provided
+* Should only be included if your skill has specific environment requirements
+* Can indicate intended product, required system packages, network access needs, etc.
+
+<Card>
+  **Examples:**
+
+  ```yaml  theme={null}
+  compatibility: Designed for Claude Code (or similar products)
+  ```
+
+  ```yaml  theme={null}
+  compatibility: Requires git, docker, jq, and access to the internet
+  ```
+
+  ```yaml  theme={null}
+  compatibility: Requires Python 3.14+ and uv
+  ```
+</Card>
+
+<Note>
+  Most skills do not need the `compatibility` field.
+</Note>
+
+#### `metadata` field
+
+The optional `metadata` field:
+
+* A map from string keys to string values
+* Clients can use this to store additional properties not defined by the Agent Skills spec
+* We recommend making your key names reasonably unique to avoid accidental conflicts
+
+<Card>
+  **Example:**
+
+  ```yaml  theme={null}
+  metadata:
+    author: example-org
+    version: "1.0"
+  ```
+</Card>
+
+#### `allowed-tools` field
+
+The optional `allowed-tools` field:
+
+* A space-delimited list of tools that are pre-approved to run
+* Experimental. Support for this field may vary between agent implementations
+
+<Card>
+  **Example:**
+
+  ```yaml  theme={null}
+  allowed-tools: Bash(git:*) Bash(jq:*) Read
+  ```
+</Card>
+
+### Body content
+
+The Markdown body after the frontmatter contains the skill instructions. There are no format restrictions. Write whatever helps agents perform the task effectively.
+
+Recommended sections:
+
+* Step-by-step instructions
+* Examples of inputs and outputs
+* Common edge cases
+
+Note that the agent will load this entire file once it's decided to activate a skill. Consider splitting longer `SKILL.md` content into referenced files.
+
+## Optional directories
+
+### `scripts/`
+
+Contains executable code that agents can run. Scripts should:
+
+* Be self-contained or clearly document dependencies
+* Include helpful error messages
+* Handle edge cases gracefully
+
+Supported languages depend on the agent implementation. Common options include Python, Bash, and JavaScript.
+
+### `references/`
+
+Contains additional documentation that agents can read when needed:
+
+* `REFERENCE.md` - Detailed technical reference
+* `FORMS.md` - Form templates or structured data formats
+* Domain-specific files (`finance.md`, `legal.md`, etc.)
+
+Keep individual [reference files](#file-references) focused. Agents load these on demand, so smaller files mean less use of context.
+
+### `assets/`
+
+Contains static resources:
+
+* Templates (document templates, configuration templates)
+* Images (diagrams, examples)
+* Data files (lookup tables, schemas)
+
+## Progressive disclosure
+
+Skills should be structured for efficient use of context:
+
+1. **Metadata** (\~100 tokens): The `name` and `description` fields are loaded at startup for all skills
+2. **Instructions** (\< 5000 tokens recommended): The full `SKILL.md` body is loaded when the skill is activated
+3. **Resources** (as needed): Files (e.g. those in `scripts/`, `references/`, or `assets/`) are loaded only when required
+
+Keep your main `SKILL.md` under 500 lines. Move detailed reference material to separate files.
+
+## File references
+
+When referencing other files in your skill, use relative paths from the skill root:
+
+```markdown SKILL.md theme={null}
+See [the reference guide](references/REFERENCE.md) for details.
+
+Run the extraction script:
+scripts/extract.py
+```
+
+Keep file references one level deep from `SKILL.md`. Avoid deeply nested reference chains.
+
+## Validation
+
+Use the [skills-ref](https://github.com/agentskills/agentskills/tree/main/skills-ref) reference library to validate your skills:
+
+```bash  theme={null}
+skills-ref validate ./my-skill
+```
+
+This checks that your `SKILL.md` frontmatter is valid and follows all naming conventions.
+
+
+Built with [Mintlify](https://mintlify.com).
@@ -0,0 +1,303 @@
+> ## Documentation Index
+> Fetch the complete documentation index at: https://agentskills.io/llms.txt
+> Use this file to discover all available pages before exploring further.
+
+# Evaluating skill output quality
+
+> How to test whether your skill produces good outputs using eval-driven iteration.
+
+You wrote a skill, tried it on a prompt, and it seemed to work. But does it work reliably — across varied prompts, in edge cases, better than no skill at all? Running structured evaluations (evals) answers these questions and gives you a feedback loop for improving the skill systematically.
+
+## Designing test cases
+
+A test case has three parts:
+
+* **Prompt**: a realistic user message — the kind of thing someone would actually type.
+* **Expected output**: a human-readable description of what success looks like.
+* **Input files** (optional): files the skill needs to work with.
+
+Store test cases in `evals/evals.json` inside your skill directory:
+
+```json evals/evals.json theme={null}
+{
+  "skill_name": "csv-analyzer",
+  "evals": [
+    {
+      "id": 1,
+      "prompt": "I have a CSV of monthly sales data in data/sales_2025.csv. Can you find the top 3 months by revenue and make a bar chart?",
+      "expected_output": "A bar chart image showing the top 3 months by revenue, with labeled axes and values.",
+      "files": ["evals/files/sales_2025.csv"]
+    },
+    {
+      "id": 2,
+      "prompt": "there's a csv in my downloads called customers.csv, some rows have missing emails — can you clean it up and tell me how many were missing?",
+      "expected_output": "A cleaned CSV with missing emails handled, plus a count of how many were missing.",
+      "files": ["evals/files/customers.csv"]
+    }
+  ]
+}
+```
+
+**Tips for writing good test prompts:**
+
+* **Start with 2-3 test cases.** Don't over-invest before you've seen your first round of results. You can expand the set later.
+* **Vary the prompts.** Use different phrasings, levels of detail, and formality. Some prompts should be casual ("hey can you clean up this csv"), others precise ("Parse the CSV at data/input.csv, drop rows where column B is null, and write the result to data/output.csv").
+* **Cover edge cases.** Include at least one prompt that tests a boundary condition — a malformed input, an unusual request, or a case where the skill's instructions might be ambiguous.
+* **Use realistic context.** Real users mention file paths, column names, and personal context. Prompts like "process this data" are too vague to test anything useful.
+
+Don't worry about defining specific pass/fail checks yet — just the prompts and expected outputs. You'll add detailed checks (called assertions) after you see what the first run produces.
+
+## Running evals
+
+The core pattern is to run each test case twice: once **with the skill** and once **without it** (or with a previous version). This gives you a baseline to compare against.
+
+### Workspace structure
+
+Organize eval results in a workspace directory alongside your skill directory. Each pass through the full eval loop gets its own `iteration-N/` directory. Within that, each test case gets an eval directory with `with_skill/` and `without_skill/` subdirectories:
+
+```
+csv-analyzer/
+├── SKILL.md
+└── evals/
+    └── evals.json
+csv-analyzer-workspace/
+└── iteration-1/
+    ├── eval-top-months-chart/
+    │   ├── with_skill/
+    │   │   ├── outputs/       # Files produced by the run
+    │   │   ├── timing.json    # Tokens and duration
+    │   │   └── grading.json   # Assertion results
+    │   └── without_skill/
+    │       ├── outputs/
+    │       ├── timing.json
+    │       └── grading.json
+    ├── eval-clean-missing-emails/
+    │   ├── with_skill/
+    │   │   ├── outputs/
+    │   │   ├── timing.json
+    │   │   └── grading.json
+    │   └── without_skill/
+    │       ├── outputs/
+    │       ├── timing.json
+    │       └── grading.json
+    └── benchmark.json         # Aggregated statistics
+```
+
+The main file you author by hand is `evals/evals.json`. The other JSON files (`grading.json`, `timing.json`, `benchmark.json`) are produced during the eval process — by the agent, by scripts, or by you.
+
+### Spawning runs
+
+Each eval run should start with a clean context — no leftover state from previous runs or from the skill development process. This ensures the agent follows only what the `SKILL.md` tells it. In environments that support subagents (Claude Code, for example), this isolation comes naturally: each child task starts fresh. Without subagents, use a separate session for each run.
+
+For each run, provide:
+
+* The skill path (or no skill for the baseline)
+* The test prompt
+* Any input files
+* The output directory
+
+Here's an example of the instructions you'd give the agent for a single with-skill run:
+
+```
+Execute this task:
+- Skill path: /path/to/csv-analyzer
+- Task: I have a CSV of monthly sales data in data/sales_2025.csv.
+  Can you find the top 3 months by revenue and make a bar chart?
+- Input files: evals/files/sales_2025.csv
+- Save outputs to: csv-analyzer-workspace/iteration-1/eval-top-months-chart/with_skill/outputs/
+```
+
+For the baseline, use the same prompt but without the skill path, saving to `without_skill/outputs/`.
+
+When improving an existing skill, use the previous version as your baseline. Snapshot it before editing (`cp -r <skill-path> <workspace>/skill-snapshot/`), point the baseline run at the snapshot, and save to `old_skill/outputs/` instead of `without_skill/`.
+
+### Capturing timing data
+
+Timing data lets you compare how much time and tokens the skill costs relative to the baseline — a skill that dramatically improves output quality but triples token usage is a different trade-off than one that's both better and cheaper. When each run completes, record the token count and duration:
+
+```json timing.json theme={null}
+{
+  "total_tokens": 84852,
+  "duration_ms": 23332
+}
+```
+
+<Tip>
+  In Claude Code, when a subagent task finishes, the [task completion notification](https://platform.claude.com/docs/en/agent-sdk/typescript#sdk-task-notification-message) includes `total_tokens` and `duration_ms`. Save these values immediately — they aren't persisted anywhere else.
+</Tip>
+
+## Writing assertions
+
+Assertions are verifiable statements about what the output should contain or achieve. Add them after you see your first round of outputs — you often don't know what "good" looks like until the skill has run.
+
+Good assertions:
+
+* `"The output file is valid JSON"` — programmatically verifiable.
+* `"The bar chart has labeled axes"` — specific and observable.
+* `"The report includes at least 3 recommendations"` — countable.
+
+Weak assertions:
+
+* `"The output is good"` — too vague to grade.
+* `"The output uses exactly the phrase 'Total Revenue: $X'"` — too brittle; correct output with different wording would fail.
+
+Not everything needs an assertion. Some qualities — writing style, visual design, whether the output "feels right" — are hard to decompose into pass/fail checks. These are better caught during [human review](#reviewing-results-with-a-human). Reserve assertions for things that can be checked objectively.
+
+Add assertions to each test case in `evals/evals.json`:
+
+```json evals/evals.json highlight={9-14} theme={null}
+{
+  "skill_name": "csv-analyzer",
+  "evals": [
+    {
+      "id": 1,
+      "prompt": "I have a CSV of monthly sales data in data/sales_2025.csv. Can you find the top 3 months by revenue and make a bar chart?",
+      "expected_output": "A bar chart image showing the top 3 months by revenue, with labeled axes and values.",
+      "files": ["evals/files/sales_2025.csv"],
+      "assertions": [
+        "The output includes a bar chart image file",
+        "The chart shows exactly 3 months",
+        "Both axes are labeled",
+        "The chart title or caption mentions revenue"
+      ]
+    }
+  ]
+}
+```
+
+## Grading outputs
+
+Grading means evaluating each assertion against the actual outputs and recording **PASS** or **FAIL** with specific evidence. The evidence should quote or reference the output, not just state an opinion.
+
+The simplest approach is to give the outputs and assertions to an LLM and ask it to evaluate each one. For assertions that can be checked by code (valid JSON, correct row count, file exists with expected dimensions), use a verification script — scripts are more reliable than LLM judgment for mechanical checks and reusable across iterations.
+
+```json grading.json theme={null}
+{
+  "assertion_results": [
+    {
+      "text": "The output includes a bar chart image file",
+      "passed": true,
+      "evidence": "Found chart.png (45KB) in outputs directory"
+    },
+    {
+      "text": "The chart shows exactly 3 months",
+      "passed": true,
+      "evidence": "Chart displays bars for March, July, and November"
+    },
+    {
+      "text": "Both axes are labeled",
+      "passed": false,
+      "evidence": "Y-axis is labeled 'Revenue ($)' but X-axis has no label"
+    },
+    {
+      "text": "The chart title or caption mentions revenue",
+      "passed": true,
+      "evidence": "Chart title reads 'Top 3 Months by Revenue'"
+    }
+  ],
+  "summary": {
+    "passed": 3,
+    "failed": 1,
+    "total": 4,
+    "pass_rate": 0.75
+  }
+}
+```
+
+### Grading principles
+
+* **Require concrete evidence for a PASS.** Don't give the benefit of the doubt. If an assertion says "includes a summary" and the output has a section titled "Summary" with one vague sentence, that's a FAIL — the label is there but the substance isn't.
+* **Review the assertions themselves, not just the results.** While grading, notice when assertions are too easy (always pass regardless of skill quality), too hard (always fail even when the output is good), or unverifiable (can't be checked from the output alone). Fix these for the next iteration.
+
+<Tip>
+  For comparing two skill versions, try **blind comparison**: present both outputs to an LLM judge without revealing which came from which version. The judge scores holistic qualities — organization, formatting, usability, polish — on its own rubric, free from bias about which version "should" be better. This complements assertion grading: two outputs might both pass all assertions but differ significantly in overall quality.
+</Tip>
+
+## Aggregating results
+
+Once every run in the iteration is graded, compute summary statistics per configuration and save them to `benchmark.json` alongside the eval directories (e.g., `csv-analyzer-workspace/iteration-1/benchmark.json`):
+
+```json benchmark.json theme={null}
+{
+  "run_summary": {
+    "with_skill": {
+      "pass_rate": { "mean": 0.83, "stddev": 0.06 },
+      "time_seconds": { "mean": 45.0, "stddev": 12.0 },
+      "tokens": { "mean": 3800, "stddev": 400 }
+    },
+    "without_skill": {
+      "pass_rate": { "mean": 0.33, "stddev": 0.10 },
+      "time_seconds": { "mean": 32.0, "stddev": 8.0 },
+      "tokens": { "mean": 2100, "stddev": 300 }
+    },
+    "delta": {
+      "pass_rate": 0.50,
+      "time_seconds": 13.0,
+      "tokens": 1700
+    }
+  }
+}
+```
+
+The `delta` tells you what the skill costs (more time, more tokens) and what it buys (higher pass rate). A skill that adds 13 seconds but improves pass rate by 50 percentage points is probably worth it. A skill that doubles token usage for a 2-point improvement might not be.
+
+<Note>
+  Standard deviation (`stddev`) is only meaningful with multiple runs per eval. In early iterations with just 2-3 test cases and single runs, focus on the raw pass counts and the delta — the statistical measures become useful as you expand the test set and run each eval multiple times.
+</Note>
+
+## Analyzing patterns
+
+Aggregate statistics can hide important patterns. After computing the benchmarks:
+
+* **Remove or replace assertions that always pass in both configurations.** These don't tell you anything useful — the model handles them fine without the skill. They inflate the with-skill pass rate without reflecting actual skill value.
+* **Investigate assertions that always fail in both configurations.** Either the assertion is broken (asking for something the model can't do), the test case is too hard, or the assertion is checking for the wrong thing. Fix these before the next iteration.
+* **Study assertions that pass with the skill but fail without.** This is where the skill is clearly adding value. Understand *why* — which instructions or scripts made the difference?
+* **Tighten instructions when results are inconsistent across runs.** If the same eval passes sometimes and fails others (reflected as high `stddev` in the benchmark), the eval may be flaky (sensitive to model randomness), or the skill's instructions may be ambiguous enough that the model interprets them differently each time. Add examples or more specific guidance to reduce ambiguity.
+* **Check time and token outliers.** If one eval takes 3x longer than the others, read its execution transcript (the full log of what the model did during the run) to find the bottleneck.
+
+## Reviewing results with a human
+
+Assertion grading and pattern analysis catch a lot, but they only check what you thought to write assertions for. A human reviewer brings a fresh perspective — catching issues you didn't anticipate, noticing when the output is technically correct but misses the point, or spotting problems that are hard to express as pass/fail checks. For each test case, review the actual outputs alongside the grades.
+
+Record specific feedback for each test case and save it in the workspace (e.g., as a `feedback.json` alongside the eval directories):
+
+```json feedback.json theme={null}
+{
+  "eval-top-months-chart": "The chart is missing axis labels and the months are in alphabetical order instead of chronological.",
+  "eval-clean-missing-emails": ""
+}
+```
+
+"The chart is missing axis labels" is actionable; "looks bad" is not. Empty feedback means the output looked fine — that test case passed your review. During the [iteration step](#iterating-on-the-skill), focus your improvements on the test cases where you had specific complaints.
+
+## Iterating on the skill
+
+After grading and reviewing, you have three sources of signal:
+
+* **Failed assertions** point to specific gaps — a missing step, an unclear instruction, or a case the skill doesn't handle.
+* **Human feedback** points to broader quality issues — the approach was wrong, the output was poorly structured, or the skill produced a technically correct but unhelpful result.
+* **Execution transcripts** reveal *why* things went wrong. If the agent ignored an instruction, the instruction may be ambiguous. If the agent spent time on unproductive steps, those instructions may need to be simplified or removed.
+
+The most effective way to turn these signals into skill improvements is to give all three — along with the current `SKILL.md` — to an LLM and ask it to propose changes. The LLM can synthesize patterns across failed assertions, reviewer complaints, and transcript behavior that would be tedious to connect manually. When prompting the LLM, include these guidelines:
+
+* **Generalize from feedback.** The skill will be used across many different prompts, not just the test cases. Fixes should address underlying issues broadly rather than adding narrow patches for specific examples.
+* **Keep the skill lean.** Fewer, better instructions often outperform exhaustive rules. If transcripts show wasted work (unnecessary validation, unneeded intermediate outputs), remove those instructions. If pass rates plateau despite adding more rules, the skill may be over-constrained — try removing instructions and see if results hold or improve.
+* **Explain the why.** Reasoning-based instructions ("Do X because Y tends to cause Z") work better than rigid directives ("ALWAYS do X, NEVER do Y"). Models follow instructions more reliably when they understand the purpose.
+* **Bundle repeated work.** If every test run independently wrote a similar helper script (a chart builder, a data parser), that's a signal to bundle the script into the skill's `scripts/` directory. See [Using scripts](/skill-creation/using-scripts) for how to do this.
+
+### The loop
+
+1. Give the eval signals and current `SKILL.md` to an LLM and ask it to propose improvements.
+2. Review and apply the changes.
+3. Rerun all test cases in a new `iteration-<N+1>/` directory.
+4. Grade and aggregate the new results.
+5. Review with a human. Repeat.
+
+Stop when you're satisfied with the results, feedback is consistently empty, or you're no longer seeing meaningful improvement between iterations.
+
+<Tip>
+  The [`skill-creator`](https://github.com/anthropics/skills/tree/main/skills/skill-creator) Skill automates much of this workflow — running evals, grading assertions, aggregating benchmarks, and presenting results for human review.
+</Tip>
+
+
+Built with [Mintlify](https://mintlify.com).