Agent Systems / Skills
Agent Skills Basics: Package Research Procedures into Reusable, Tested Skills
Turn 'rewrite a prompt every time' into 'call a documented, scripted, tested skill,' so an agent runs research procedures discoverably, composably, and reliably.
Start Here
What you should be able to do
Package a reusable research step into a skill: name, when_to_use, inputs, steps, artifacts.
Understand skill selection: the agent matches the task description to the best skill instead of hard-coding.
Compose skills into a gated pipeline, and know that overall reliability is the product of per-step pass rates.
Know skills need tests: cases plus a pass rate, treating skills like code.
Understand progressive disclosure: load a skill body only when triggered to save context.
Learning Path
Learning path: define -> select -> compose -> test -> disclose
Read skills along this path: write the procedure as a contract, select by description, compose a gated pipeline, write tests for each step, then manage context with progressive disclosure.
Step 1
Define
Write the procedure as name + when_to_use + steps + run.
Step 2
Select
Pick the best skill by task-description similarity.
Step 3
Compose
Compose skills by dependency into a pipeline.
Step 4
Test
Write cases per skill and track the pass rate.
Step 5
Disclose
Load only triggered skill bodies to save context.
01 / Intuition
Core Intuition
A prompt is a one-off verbal instruction; a skill is a written, rerunnable, shareable procedure manual — the latter is like a lab SOP.
The core of a skill is when_to_use: a clear trigger lets the agent pick the right skill in the right situation instead of stuffing everything into one big prompt.
Skills compose: clean -> estimate -> table -> robustness, where each step is an independent, testable skill, and the chain is the research pipeline.
The more skills you have, the more context matters: describe all skills with a light index and load the full body only when a skill is triggered.
02 / Math
Model skills as contract, selection, composition, and reliability
01 / Skill contract
A skill is an executable unit with metadata: name, trigger, inputs, steps, and artifacts.
02 / Skill selection
The agent selects the best skill by similarity between the task and each skill description — like tool discovery, but higher level.
03 / Compose a pipeline
Skills compose by dependency order; one skill's artifact is the next skill's input.
04 / Pre-gate
Each skill checks preconditions before running; if unmet, it returns or errors instead of proceeding broken.
05 / Pipeline reliability
If skills pass independently, pipeline reliability is the product of per-step pass rates — which is why you test each one.
06 / Context economy
Progressive disclosure: context cost = index cost + the body cost of only the triggered skills.
03 / Code
Code cases: define, select, compose, and test skills
Use plain Python to simulate the beginner logic of skills: write a procedure as a contract, select by description, compose a pipeline, compute reliability, do progressive disclosure, and write tests.
Case 1: select a skill automatically from the task description
Do not hard-code which skill to call. Let the agent select by overlap between task and skill descriptions.
REGISTRY = {
"clean_panel": "balance a panel drop duplicates handle missing values",
"event_study": "panel staggered treatment event time relative dummies",
"make_table": "render regression results into a publication table",
}
def select(task, registry):
tok = set(task.lower().split())
scored = {name: len(tok & set(desc.split())) for name, desc in registry.items()}
return max(scored, key=scored.get), scored
task = "estimate an event study for a staggered treatment in panel data"
best, scores = select(task, REGISTRY)
print("scores:", scores)
print("selected skill:", best)Expected output
scores: {'clean_panel': 2, 'event_study': 4, 'make_table': 1}
selected skill: event_studyHow to read this code
- The most overlapping skill is selected — an argmax over similarity.
- Real systems use embedding similarity, but the idea is the same: good descriptions enable good selection.
Case 2: compose skills into a gated pipeline
clean -> estimate -> table, each an independent skill, where a failed pre-gate raises instead of proceeding.
def clean_panel(state):
state["clean"] = True; return state
def event_study(state):
assert state.get("clean"), "needs a clean panel first"
state["coefs"] = [0.01, 0.04, 0.12]; return state
def make_table(state):
assert "coefs" in state, "needs estimates first"
state["table"] = "outputs/event_study.tex"; return state
pipeline = [clean_panel, event_study, make_table]
state = {"panel": "firm-year.csv"}
for skill in pipeline:
state = skill(state)
print("artifacts:", {k: state[k] for k in ("clean", "coefs", "table")})Expected output
artifacts: {'clean': True, 'coefs': [0.01, 0.04, 0.12], 'table': 'outputs/event_study.tex'}How to read this code
- One skill's artifact is the next skill's input, forming a research pipeline.
- The gate (assert) prevents proceeding broken — e.g., no estimation before cleaning.
Case 3: pipeline reliability = product of per-skill pass rates
Five seemingly high skills chained together have a noticeably lower overall reliability.
import math
pass_rates = {"clean_panel": 0.99, "event_study": 0.95, "make_table": 0.98}
R = math.prod(pass_rates.values())
print("per-skill pass rates:", pass_rates)
print(f"pipeline reliability = {R:.3f}")
print(f"failure rate = {1 - R:.3f} -> test each skill, do not trust the chain")Expected output
per-skill pass rates: {'clean_panel': 0.99, 'event_study': 0.95, 'make_table': 0.98}
pipeline reliability = 0.922
failure rate = 0.078 -> test each skill, do not trust the chainHow to read this code
- 0.99x0.95x0.98 ~ 0.92, a failure rate near 8%.
- This is why you test each skill rather than trust the whole chain.
Case 4: progressive disclosure — load a skill body only when triggered
Describe all skills with a light index and load the full body only for the triggered skill.
skills = {
"clean_panel": {"index": 40, "body": 1200},
"event_study": {"index": 45, "body": 1500},
"make_table": {"index": 38, "body": 900},
}
triggered = ["event_study"] # only this matches the task
index_only = sum(s["index"] for s in skills.values())
progressive = index_only + sum(skills[n]["body"] for n in triggered)
load_all = index_only + sum(s["body"] for s in skills.values())
print("index-only tokens: ", index_only)
print("progressive-disclosure: ", progressive)
print("load-everything tokens: ", load_all)Expected output
index-only tokens: 123
progressive-disclosure: 1623
load-everything tokens: 3723How to read this code
- Index is 123 tokens; load only the triggered skill -> 1623; load all -> 3723.
- Progressive disclosure lets a skill library grow large without flooding the context.
Case 5: skills need tests too
Like code, write cases (including edges and failures) for a skill and report the pass rate.
def event_study_skill(n_periods):
if n_periods < 2:
raise ValueError("need >= 2 periods")
return {"ok": True, "periods": n_periods}
test_cases = [5, 3, 1, 8, 0]
results = []
for tc in test_cases:
try:
event_study_skill(tc); results.append(True)
except Exception:
results.append(False)
print("results:", results)
print(f"pass rate = {sum(results)/len(results):.2f} (skills need tests, like code)")Expected output
results: [True, True, False, True, False]
pass rate = 0.60 (skills need tests, like code)How to read this code
- The pass rate exposes a skill's fragility on edge inputs.
- An untested skill fails silently as data and dependencies change.
04 / Case
Case: package an event study into a reusable skill
- You run event studies across many projects: build relative-time dummies, estimate, plot coefficients. Rewriting each time invites errors.
- Package it as an event_study skill: SKILL.md states when to use it and the steps, scripts/ holds rerunnable code, tests/ holds cases.
- Given the task 'estimate an event study for a staggered treatment,' the agent selects the skill by description and checks the panel is cleaned before running (a gate).
- Skills compose: clean_panel -> event_study -> make_table -> robustness_suite, each leaving a trace; overall reliability is the product of pass rates, so every skill needs tests.
05 / Causal
Bridge to causal inference: make identification and robustness into skills
Skills make the key steps of causal research reusable, testable, and auditable: identification checks, estimation, and robustness can each be independent skills composed into a reproducible causal pipeline.
01 / Identification checks as skills (assumption -> skill)
Write parallel-trends, overlap, and instrument relevance / exclusion as check skills, run before estimation.
02 / Estimation as skills (design -> estimator skill)
Package an estimation skill per design (DiD / IV / RD / DML), callable by passing design parameters.
03 / Robustness as a skill suite (estimate -> robustness suite)
Bundle placebo, sensitivity, and honest DiD into one reusable robustness skill suite.
04 / Reliability is quantifiable (chain -> tested chain)
The reliability of the whole causal pipeline is the product of skill pass rates, forcing a test per step.
Three red lines: (1) skills reuse procedures, but identification assumptions still need a human to defend; (2) skills must have tests, or they fail silently when data changes; (3) high-risk skills need a gate and human confirmation — automation must not overreach.
06 / Risks
Common Pitfalls
Resources
Hands-on downloads
References
- Anthropic (2024), Building Effective Agentshttps://www.anthropic.com/research/building-effective-agents
- Wang et al. (2023), Voyager: An Open-Ended Embodied Agent with LLMs (skill library)https://arxiv.org/abs/2305.16291
- Model Context Protocolhttps://modelcontextprotocol.io
- Anthropic, Claude Code documentationhttps://docs.anthropic.com/en/docs/claude-code/overview