SkillsBench Research Shows Real Impact of Skills on LLM Agents
SkillsBench, a new benchmark and research project, tested the impact of Skills on Large Language Model (LLM) agents across 84 tasks in 11 domains with 7 model configurations including Claude, Gemini, and Codex. The study found that ready-made Skills increased the pass rate by an average of 16.2 percentage points, with the highest gains in medicine (+51.9%) and manufacturing (+41.9%), while software development saw only a small improvement (+4.5%).
Interestingly, self-generated Skills by models did not improve performance and in some cases worsened it, indicating models struggle to create reliable knowledge themselves. The research also revealed that an optimal number of Skills modules is 2–3, with too many modules or overly detailed documentation negatively impacting results.
Gemini 3 Flash achieved the best absolute performance (48.7% pass rate) with Skills, at a lower cost per task compared to more expensive configurations.
For more details, visit the SkillsBench project page and the full paper on arXiv.