AI & Cybersecurity
This page provides a starting point for exploring the relationship between “AI” (particularly LLMs) and cybersecurity. An initial collection of current benchmarks measuring offensive and defensive cybersecurity capabilities of current-generation LLMs is provided below. This page will be continuously updated to include further research.
Last update: December 13, 2025.
| Benchmark (Name) | Creator/Author | Release Date | Source Code/Demo | Paper/Documentation |
|---|---|---|---|---|
| AthenaBench | Athena Security Group | 2025 | Github | arXiv |
| CAIBench | Alias Robotics, University of MNaples Frederico II | 2025 | Github | arXiv |
| CyberSecEval 4 | Meta | 2025 | Github | Github Pages |
| CyberGym | UC Berkeley | 2025 | Github | arXiv |
| CyberMetric | TTI, University of Oslo, Khalifa University | 2025 | Github | arXiv |
| Cybersecurity AI (CAI) | Alias Robotics | 2025 | Github | Github Pages |
| CySecBench | IEEE, NSS Group, KTH Royal Institute of Technology | 2025 | Github | arXiv |
| CVE-Bench | US AI Safety Institute | 2025 | Github | arXiv |
| ExCyTIn 8.4.1 | Microsoft | 2025 | Github (SecRL) | arXiv |
| Frontier AI Safety Framework | Google DeepMind | 2025 | - | arXiv |
| HonestCyberEval | Alan Turing Institute | 2025 | - | arXiv |
| HTB AI | Hack The Box | 2025 | Hack The Box | - |
| Cyberbench | Stanford University | 2024 | Github | arXiv |
| SecLLMHolmes | IBM, UNSW Sydney, Boston University | 2024 | Github | arXiv |
| SECURE | Rochester Institute of Technology | 2024 | Github | arXiv |