DARPA, the research arm of the U.S. military, will host a two-year competition called the AI Cyber Competition (AIxCC) that hopes to drive innovation to create the next generation of cybersecurity tools. The move comes as the U.S. military explores how artificial intelligence fits into its operations.
“AIxCC represents a first-of-its kind between top AI companies, led by DARPA, to create AI driven systems to help address one of society’s greatest challenges—cybersecurity,” says Perri Adams, DARPA’s AIxCC program director. “In the past decade, we’ve seen the development of promising AI-enabled capabilities. When used responsibly, we see significant potential for this technology to be applied to key cybersecurity issues.” The winner of the DARPA competition will receive a $4 million prize, while second and third place winners will receive $3 million and $1.5 million respectively at DEFCON in August, 2025.
The DARPA competition comes as the Department of Defense (DOD) grapples with AI’s role in the military. Giving DOD generative AI planners pause is a new study by Stanford University that examined how ChatGPT performed over a three-month period earlier this year. Alarmingly, the study found that AI accuracy varied widely in solving difficult math problems and called into question whether generative AI gets better with each new version.
“GPT4 accuracy dropped from 97.6 percent in March to 2.4 percent in June and there was a large improvement in GPT 3.5’s accuracy from 7.4 percent to 86.8 percent,” Stanford reported.
From the military’s perspective, the continual improvement of AI is mission critical. Such variability along with data security concerns and the tendency of AI to generate sometimes bizarrely inaccurate results from “hallucinations” have made the military reluctant to embrace AI beyond such areas as cybersecurity. ChatGPT’s tendency toward hallucinations has been underlined by a defamation lawsuit against OpenAI in which ChatGPT claimed a Georgia radio host was an embezzler.
The Stanford study indicates the large-language models are unstable, with performance varying widely from month-to-month, so reliable engineering can’t be built on top of AI platforms. OpenAI has acknowledged the problem and says it is working on the issue. Trust is the bottom line: If ChatGPT is unreliable over time and users don’t understand how it comes to its conclusions, the military is unlikely to use it in the short term. Over the longer term, DOD is likely to develop its own LLM models, but that will be a challenging and time-consuming task.
Concerns over AI-assisted mutually assured destruction (MAD) have extended to Congress where a bi-partisan bill seeks to put into law the requirement that any decision to launch a nuclear weapon should not be made by an AI. Whether potential adversaries will follow suit is an open question: None of the world’s nine nuclear powers have signed on to a Treaty on the Prohibition of Nuclear Weapons that might provide a framework to address the issue. Similarly, declarations this year regarding the use of AI and autonomy by many countries are not legally binding.
Regardless, some of AI mission creep into military operations appears inevitable. In recent weeks, the Israeli military (IDF) revealed that it is using an AI system called Fire Factory to rapidly target air strikes with the caveat of human supervision. Critics worry that in a full-out conflict with Iran, for example, in which time frames are reduced from hours to minutes, an AI’s decision making may not be fully transparent. Israel’s admission also may encourage other nations to develop AI tools for battlefield use.
And while we are a long way from Hollywood’s AI “killer robots,” Patriot missiles reportedly have an automatic mode which allows them to fire without human intervention if overwhelmed by enemy rockets arriving faster than a human can react, a real concern in an age of hypersonic missiles. The worry is that speed may deepen the fog of war, whether it’s from autonomous AI systems or supervised AIs that make human miscalculations more likely.