OpenAI and Paradigm have launched EVMbench, a benchmark aimed at assessing AI agents’ proficiency in identifying, patching, and exploiting vulnerabilities within smart contracts.
This benchmark is based on 120 curated vulnerabilities sourced from 40 audits, predominantly from open-code audit competitions. It also includes scenarios from the security auditing process of the Tempo blockchain, a Layer 1 network designed for high-throughput, low-cost stablecoin payments. This inclusion aims to broaden EVMbench’s coverage into payment-oriented smart contract code, mirroring anticipated growth in agentic stablecoin payment activities.
Three Evaluation Modes and Key Insights
EVMbench evaluates AI agents through three distinct capability modes. In the detect mode, agents are tasked with auditing smart contract repositories and are scored based on their thoroughness in identifying known vulnerabilities. The patch mode requires agents to modify vulnerable contracts to remove exploitability while ensuring full functionality, all of which is verified via automated testing. Finally, the exploit mode challenges agents to execute fund-draining attacks against deployed contracts within a sandboxed blockchain environment, with results being programmatically validated.
Across the models tested, performance was notably higher in the exploit mode, where the goal of draining funds is clear and iterative. GPT-5.3-Codex, running via Codex CLI, achieved a score of 72.2% in exploit mode compared to 31.9% for its predecessor, GPT-5, which was launched around six months ago. Performance in detect and patch modes fell short of full coverage; agents often ceased after identifying one vulnerability rather than completing the entire audit process in detect mode, while patch mode posed significant challenges to maintaining complete contract functionality.
Ecosystem Impact and Defensive Investment
As smart contracts currently safeguard more than USD 100 billion in open-source crypto assets, OpenAI has positioned EVMbench as both a measurement tool and an incentive for the security community to integrate AI-assisted auditing into their standard workflows. Accompanying the benchmark’s release, OpenAI announced USD 10 million in API credits aimed at supporting cybersecurity efforts, with priority given to open-source software and critical infrastructure. The tasks, tools, and evaluation framework have been made publicly available.










