SkillNet: Create, Evaluate, and Connect AI Skills

Yuan Liang; Ruobin Zhong; Haoming Xu; Chen Jiang; Yi Zhong; Runnan Fang; Jia-Chen Gu; Shumin Deng; Yunzhi Yao; Mengru Wang; Shuofei Qiao; Xin Xu; Tongtong Wu; Kun Wang; Yang Liu; Zhen Bi; Jungang Lou; Yuchen Eleanor Jiang; Hangcheng Zhu; Gang Yu; Haiwen Hong; Longtao Huang; Hui Xue; Chenxi Wang; Yijun Wang; Zifei Shan; Xi Chen; Zhaopeng Tu; Feiyu Xiong; Xin Xie; Peng Zhang; Zhengke Gui; Lei Liang; Jun Zhou; Chiyu Wu; Jin Shang; Yu Gong; Junyu Lin; Changliang Xu; Hongjie Deng; Wen Zhang; Keyan Ding; Qiang Zhang; Fei Huang; Ningyu Zhang; Jeff Z. Pan; Guilin Qi; Haofen Wang; Huajun Chen

SkillNet: Create, Evaluate, and Connect AI Skills

Intermediate

Yuan Liang, Ruobin Zhong, Haoming Xu et al.2/26/2026

arXiv

Key Summary

•Before SkillNet, AI agents kept solving the same kinds of problems over and over without saving what they learned in a clean, reusable way.
•SkillNet turns helpful know-how into shareable 'skills' that come with clear instructions, tags, and safety checks, so any agent can use them later.
•These skills live in a big, organized map that shows which ones are similar, which ones depend on others, and which ones fit together like Lego pieces.
•New skills are created automatically from many places—like chat logs, GitHub projects, and documents—then filtered, tested, and scored.
•Each skill is graded on safety, completeness, executability, maintainability, and cost-awareness to keep the library trustworthy.
•Across three test worlds (home chores, online shopping, and science labs), agents with SkillNet earned much higher rewards while using fewer steps.
•SkillNet includes a website, an API, and a Python tool so people and agents can search, download, create, evaluate, and connect skills easily.
•This approach helps agents grow steadily from short-term tries to long-term mastery, reducing 'reinventing the wheel' and making complex tasks smoother.
•SkillNet already contains over one hundred fifty thousand curated, high-quality skills and keeps expanding with community help.
•In everyday terms, SkillNet is like a recipe book plus a safety inspector plus a map that shows how recipes combine—so cooks (agents) get better every week.

Why This Research Matters

When agents reuse safe, well-written skills, they finish tasks faster and make fewer mistakes, saving people time and money. Teams can share skills across projects, so a win in one place helps everyone else right away. Companies can keep private skill libraries that capture their best practices, making onboarding smoother and results more consistent. In science and engineering, linked skills let agents run repeatable workflows, improving reliability and reducing hidden errors. For everyday users, this means better assistants that don’t just guess—they follow tested steps and improve over time.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how when you learn to ride a bike, you don’t start from scratch every day—you keep what worked, fix what didn’t, and the next ride is easier? Early AI didn’t do that very well. For a long time, AI was either very strict and rule-based (great for clarity, but brittle) or very flexible and data-driven (great for power, but hard to understand and reuse piece by piece). Then came AI agents that can plan steps and use tools. They can do multi-step tasks like browsing the web or analyzing data. But there was a catch: after finishing one task, they often didn’t save their best tricks in a tidy, reusable form. So, next time, they might stumble again, like forgetting last week’s bike tips.

The world before SkillNet looked like this: people glued together prompts, snippets of code, and workflow notes. Each project had its own private set of tricks, and even good ideas stayed trapped in chats, logs, or messy folders. Repositories of agent “skills” started to appear, but most were just lists with little quality control. The result? Agents kept “reinventing the wheel,” wasting time, making risky moves, and failing to transfer know-how across tasks.

People tried to fix it. They used prompt engineering to make better instructions, but prompts are easy to break and hard to reuse across contexts. They built memory systems so agents could recall past steps, but memories can be vague, messy, or too long to search quickly. They crafted custom workflows, but those were rigid and hard to adapt. None of these really turned messy experience into clean, testable, and shareable building blocks.

What was missing was a strong, shared way to: (1) turn experience into clear skills with instructions and resources; (2) check those skills for safety and quality; and (3) connect them so agents can find the right skill at the right time and combine them safely. Without this, even smart agents act like students who never write down their best study notes.

Why does this matter in real life? Because agents help with things we care about: drafting reports, comparing products, refactoring code, cleaning data, and even running mini lab experiments in simulations. If they don’t reuse what they learn, tasks take longer, cost more, and fail more often. If they reuse unsafe or low-quality steps, bad results become habits. A unified, quality-checked skill library means faster answers, safer actions, and steady improvement over time—just like a school that keeps its best lessons neatly organized so every class gets better year after year.

02Core Idea

🍞 Top Bread (Hook): Imagine a giant recipe book where every recipe is tested for safety, clearly written, and linked to other recipes it works with—so any cook can make a full meal without guessing.

🥬 The Concept: SkillNet’s key insight is that agents should store what they learn as modular, testable, and connected “skills,” not just as loose memories or prompts.

How it works (big picture):
1. Create skills from many sources (logs, code, documents, prompts).
2. Evaluate each skill across multiple dimensions (safety, completeness, executability, maintainability, cost-awareness).
3. Organize skills in an ontology (categories, relations, and packages) so agents can find, combine, and reuse them.
Why it matters: Without this, agents keep guessing, repeat work, and risk unsafe or broken steps. With it, they improve steadily and dependably.

🍞 Bottom Bread (Anchor): An agent shopping online can fetch a “compare products” skill, safely pair it with a “filter reviews” skill, and then add a “budget checker” skill—finishing in fewer steps with fewer mistakes.

Multiple analogies:

Library analogy: SkillNet is like a library where every book (skill) is clearly labeled, safety-checked, and shelved with related books, making research fast and reliable.
Lego analogy: Each skill is a Lego brick with clean connectors; SkillNet shows which bricks snap together and which ones need support pieces.
Recipe analogy: Each recipe has ingredients (requirements), steps (instructions), tools (code/hooks), safety tips (warnings), and pairings (compose-with), so cooks can make a meal plan, not just a single dish.

Before vs. After:

Before: Agents rely on long prompts and fuzzy memories; quality varies; reuse is rare; combining steps is risky.
After: Agents pull well-scored skills off a shelf; execution is faster; reuse is common; compositions are safer and clearer.

Why it works (intuition): Putting knowledge into small, testable units makes it easier to trust and recombine. Scoring skills by safety and quality filters out bad habits. Mapping relationships points agents toward the right next step.

Building blocks in the learning order (with Sandwich explanations):

🍞 Skill Taxonomy

Hook: You know how a grocery store groups items into aisles like fruits, dairy, and snacks?
Concept: A skill taxonomy sorts skills into categories and tags so they’re easy to browse.
- How: 1) Big categories (like Development, Science). 2) Finer tags (like frontend, biology). 3) Attach tags to each skill.
- Why: Without sorting, you’re searching a messy pile.
Anchor: Looking for a plotting skill? Check the “Data Science” aisle, then the “visualization” shelf.

🍞 Skill Ontology

Hook: Imagine a school map showing classrooms (skills), hallways (relations), and departments (categories).
Concept: The skill ontology is the full structure that defines skill types, categories, and relations.
- How: 1) Define categories/tags. 2) List skills as entities. 3) Add relations like depen $d_o$ n and compos $e_w$ ith.
- Why: Without structure, agents don’t know what connects to what.
Anchor: A “Playwright testing” skill belongs to Testing, tagged with browser-automation, and depends on setup skills.

🍞 Multi-dimensional Evaluation Framework

Hook: When you grade a project, you don’t just look at neatness—you check accuracy, safety, and effort too.
Concept: Skills are scored along several dimensions, not just “works or not.”
- How: 1) Define safety, completeness, executability, maintainability, cost-awareness. 2) Use an LLM judge + sandbox tests. 3) Keep only good ones.
- Why: One score can hide big problems.
Anchor: A data-cleaning skill passes safety and executability but is marked average in maintainability—so it’s flagged for refactor.

🍞 Skill Quality Metrics

Hook: Like a report card with different subjects.
Concept: Each dimension (safety, completeness, executability, maintainability, cost-awareness) is a separate score.
- How: 1) Check for harmful actions. 2) Check steps and prerequisites. 3) Test run in a sandbox. 4) Inspect modularity. 5) Estimate time/API cost.
- Why: Low safety or missing steps can break whole workflows.
Anchor: A scraping skill is safe and cheap but incomplete (missing login step), so it’s fixed before use.

🍞 Skill Relation Graph

Hook: Think of a friendship map that shows who is friends, helpers, or teammates.
Concept: A graph links skills by simila $r_t$ $r_{t}$ o, depen $d_o$ $d_{o}$ n, belon $g_t$ $g_{t}$ o, and compos $e_w$ $e_{w}$ ith.
- How: 1) Find candidate links via embeddings and traces. 2) Confirm with LLM reasoning. 3) Build a typed, directed graph.
- Why: Without a map, agents can’t plan multi-step journeys.
Anchor: “Fetch product specs” compos $e_w$ ith “compare specs,” which depen $d_o$ n “API key setup.”

🍞 SkillNet

Hook: Picture a clean workshop with labeled tools, safety checks, and a guidebook on how tools combine.
Concept: SkillNet is the end-to-end system to create, evaluate, and connect skills at scale.
- How: 1) Create from diverse sources. 2) Evaluate on multiple dimensions. 3) Organize in an ontology. 4) Serve via website, API, and Python tool.
- Why: This turns scattered tricks into durable, reusable capabilities.
Anchor: An agent doing lab tasks pulls a validated “pathway analysis” skill directly from SkillNet.

🍞 Automated Skill Creation

Hook: Imagine a smart assistant that reads your notes and turns them into neat, ready-to-run checklists.
Concept: LLMs convert logs, code, and docs into standardized SKILL.md packages.
- How: 1) Parse source. 2) Extract steps and requirements. 3) Generate instructions/resources. 4) Tag and test.
- Why: Manual writing doesn’t scale to thousands of skills.
Anchor: Paste a GitHub URL; the system drafts a “component-refactoring” skill with triggers, steps, and safety notes.

03Methodology

At a high level: Inputs → Skill Creation → Skill Evaluation → Skill Organization & Analysis → Outputs (usable skills + graphs + packages).

Step A: Skill Creation

What happens: The system ingests execution trajectories, GitHub repos, office docs (PDF/Word/PPT), and direct prompts. An LLM extracts clear steps, prerequisites, and resources; then writes a SKILL.md with metadata (name, description, usage conditions), instructions, and optional scripts/templates.
Why it exists: Experience is scattered. Turning it into a standard package makes it reusable and testable.
Example: From a chat log about refactoring a long React component, SkillNet writes a skill: when to trigger (file length > 300 lines), steps (split into subcomponents, extract hooks), and cautions (don’t refactor third-party wrappers).

Secret sub-steps in creation:

Source parsing: Identify goals, tools, and constraints in the input.
Instruction synthesis: Draft clear, step-by-step procedures.
Resource bundling: Include code snippets, templates, or configs.
Metadata/tagging: Add category and tags for discovery.
Draft validation: Quick LLM sanity checks.

Step B: Data-Driven Filtering and Consolidation

What happens: Run deduplication (compare folder structures and file hashes), filter out low-quality drafts, assign categories/tags, then send candidates to evaluation.
Why it exists: Prevents clutter and keeps the library clean.
Example: Two nearly identical plotting skills are merged; the clearer one stays.

Step C: Multi-Dimensional Evaluation

What happens: A rubric-guided LLM judge (plus sandbox execution for code/tools) scores each skill on five dimensions; skills are labeled Good/Average/Poor and only solid ones advance.
Why it exists: A single pass/fail hides issues like unsafe steps or missing prerequisites.
Example: A scraping skill passes safety (no deletion), but executability fails in the sandbox because of a missing login cookie; the fix is added before acceptance.

Sandwich mini-briefs for the five dimensions:

🍞 Safety
- Hook: Like wearing a helmet before biking.
- Concept: Safety checks block dangerous or harmful actions.
  - How: Scan for risky commands, prompt-injection risks, and permission rules; test in a sandbox.
  - Why: One unsafe step can cause big damage.
- Anchor: A file-cleanup skill refuses to run recursive deletes outside a temp folder.
🍞 Completeness
- Hook: A recipe that forgets to list “preheat the oven” will fail.
- Concept: Completeness means all critical steps and prerequisites are present.
  - How: Check for inputs, env setup, API keys, and edge cases.
  - Why: Missing one step can stall the whole workflow.
- Anchor: A science skill adds “install bioservices” before calling its functions.
🍞 Executability
- Hook: Instructions are only useful if they actually work.
- Concept: Executability verifies the skill runs end-to-end.
  - How: Dry-run or sandbox run; catch hallucinated tools or wrong commands.
  - Why: Saves time and prevents silent failures.
- Anchor: A CLI test confirms “playwright install” succeeds before tests run.
🍞 Maintainability
- Hook: It’s easier to fix a bike with parts you can detach.
- Concept: Maintainability checks if a skill is modular and easy to update.
  - How: Look for clean sections, version notes, and backward-compatible steps.
  - Why: Fragile skills break when anything changes.
- Anchor: A plotting skill isolates theme settings so updates don’t affect data steps.
🍞 Cost-awareness
- Hook: Turning off lights when you leave saves money.
- Concept: Cost-awareness tracks time, compute, and API costs.
  - How: Estimate calls, latency, and resource use; suggest cheaper paths when possible.
  - Why: Keeps projects fast and affordable.
- Anchor: A review-summarizer switches to batch API calls to cut costs.

Step D: Skill Organization (Ontology + Packages)

What happens: Accepted skills are placed into categories/tags (taxonomy), linked with relations (relation graph), and bundled into task-oriented packages.
Why it exists: Organization turns a list into a library you can navigate and compose.
Example: A “data-science-visualization” package includes matplotlib, seaborn, and plotly skills with clear compos $e_w$ ith links.

Step E: Skill Analysis (Relation Graph)

What happens: The system discovers relations using embeddings, dependency extraction, trace alignment, and LLM reasoning—then builds a typed, directed graph.
Why it exists: The graph helps agents choose the next right skill, resolve dependencies, and assemble workflows.
Example: “KEGG database” depen $d_o$ n “bioservices setup”; “gene-pathway mapping” compos $e_w$ ith “target validation.”

Secret Sauce (what’s clever):

Treating know-how as small, testable, and connected units makes growth cumulative, not one-off. Automated creation scales up the library; multi-dimensional evaluation keeps it trustworthy; the relation graph makes smart composition possible.

Interfaces and Tools:

Website: Browse and download skills; see docs and examples.
API: Keyword and vector search for integration.
Python toolkit (skillnet-ai): Search, download, create-from-sources, evaluate, and analyze relations—usable in scripts or CLIs.

Output:

A curated, growing repository of high-quality skills; relation graphs for planning; and ready-to-use packages for domains like science workflows, web automation, and coding support.

04Experiments & Results

The Test: Researchers wanted to see if agents using SkillNet could do complex tasks better and faster. They tested in three simulated worlds: a home environment (ALFWorld), an online shopping world (WebShop), and a science lab world (ScienceWorld). These settings require step-by-step decisions with only partial information, which is perfect for testing whether reusable skills help.

What they measured and why:

Average reward: Like a final score—higher means the agent did more of the right things.
Average steps: Fewer steps mean the agent navigated the task more efficiently, with less wandering.

The Competition (baselines):

ReAct: A common method where the agent reasons, acts, then observes—repeat.
Expel: The agent learns from previous tries by summarizing lessons and reusing good trajectories.
Few-Shot: The agent sees one example solution before trying (a simple in-context hint).

How SkillNet was used: For each benchmark, the team synthesized a focused skill collection using expert trajectories. During testing, agents could search, activate, and execute the most relevant skills on the fly. They tried this with different backbone models (from smaller to larger) to see if SkillNet helps across sizes.

The Scoreboard with context:

With SkillNet, agents earned much higher rewards and used far fewer steps than the baselines. Think of it like going from a B- average to a strong A while also finishing the test faster.
The improvements were steady across different models: small models benefited (less stumbling), and big models benefited too (less wasted motion and fewer dead ends). This means the gains aren’t just because of model power—they’re because the skill library truly helps.

What’s surprising:

Executability is often the trickiest part in real life. Even so, the evaluation pipeline managed to keep the curated set strong enough that agents reliably improved across all three worlds.
The boosts held up on unseen tasks (not just similar repeats), suggesting that well-written, well-linked skills transfer across new goals better than long prompts or vague memory alone.

Make the numbers meaningful:

Picture a maze: a baseline agent might wander for a long time before finding the exit, earning fewer points. The SkillNet-augmented agent has a pocket map of safe, tested turns. It reaches the goal faster and scores higher. That’s the essence of the result: fewer wrong turns, more solid progress.

Takeaway:

Turning experience into clean skills, scoring those skills carefully, and linking them into a graph gives agents a dependable edge—better outcomes in less time, consistently across tasks and models.

05Discussion & Limitations

Limitations (honest view):

Coverage isn’t complete: Some niche or private-domain abilities aren’t in the library yet, and very subtle “tacit” skills are hard to write down clearly.
Quality varies in self-constructed skills: Even with filters and tests, a big library can contain skills that need more polishing.
Security risks remain: Malicious contributions or tricky prompt injections can slip through—safety checks help but aren’t perfect.
Not yet fully end-to-end: There’s no one-click pipeline that turns plain language tasks directly into complete, fully instantiated agents in every domain.

Required resources:

A capable LLM for creation/evaluation, sandbox environments for execution tests, and storage/search tools for hosting the repository and relation graphs.

When not to use:

Ultra-novel tasks where no stable procedure exists yet (no skill to extract), or settings with strict compliance rules that forbid third-party instructions without deep review.
Environments where execution can’t be safely sandboxed (e.g., direct production databases without guardrails).

Open questions:

Automatic skill evolution: How can agents routinely upgrade, split, or merge skills as they learn?
Neuro-symbolic synergy: How should models, memories, workflows, and skills guide each other dynamically?
Multi-agent sharing: What’s the best way for groups of agents to co-own, trade, and improve skills while keeping safety and provenance clear?
Robust safety at scale: How to spot and neutralize poisoned or adversarial skills in giant, fast-growing libraries?

Bottom line: SkillNet is a big step toward dependable, reusable agent know-how—but continued work on coverage, safety, and automatic evolution will make it even stronger.

06Conclusion & Future Work

Three-sentence summary: SkillNet turns scattered experience into clean, reusable, and connected skills that are safety-checked and easy for agents to find and combine. This structure helps agents earn higher rewards in fewer steps across very different tasks by replacing guesswork with tested, composable procedures. In short, it upgrades agents from short-term improvisers to long-term builders of mastery.

Main achievement: A full-lifecycle infrastructure—creation, evaluation, and connection—plus a large, curated repository and tools that make skill reuse practical and reliable.

Future directions: Grow private/industry-specific libraries; make automatic skill evolution routine; tighten large-scale safety; deepen the synergy between models, memory, workflows, and skills; and improve one-click pipelines from natural-language goals to assembled agent stacks.

Why remember this: SkillNet shows a simple but powerful idea—package know-how as small, testable, connectable units—and proves that this makes agents stronger and more efficient. It turns “one-off cleverness” into “ongoing mastery,” the same way good lesson plans help each new class start ahead rather than from zero.

Practical Applications

•Automate data-cleaning and visualization workflows with reusable, tested steps.
•Speed up e-commerce tasks like product search, comparison, and budget checking.
•Support scientific analysis pipelines (e.g., pathway mapping and target validation).
•Improve software engineering with skills for refactoring, testing, and documentation.
•Standardize customer-support playbooks that are safe, clear, and easy to update.
•Create private, domain-specific skill libraries for regulated industries.
•Reduce API costs by using cost-aware skills that batch and cache requests.
•Build multi-step web automation flows (login, scrape, parse, summarize) safely.
•Enable teaching agents to learn once and share via skills across teams.
•Audit and improve existing prompts/code by turning them into evaluated skills.

Version: 1