By Will Pearce, Nick Landers, and David A. Wheeler
Society depends on software, but sadly, most of today’s software has many vulnerabilities. Those vulnerabilities take time to manually find and fix. Just finding vulnerabilities in software isn’t enough—vulnerabilities need to be fixed and fixes deployed.
All software can have vulnerabilities, and some of the most important software is open source software (OSS). Since OSS can be used for any purpose, reviewed, modified, and shared, often at no cost, it is often widely used. This widespread use means that a vulnerability in some OSS can cause widespread impact.
Could artificial intelligence (AI) practically help find and fix vulnerabilities in a scalable way? We don’t know for certain, but there’s hope that it could. In this article, we’ll look at a competition to encourage the development of AI-enabled tools that will automatically find and fix vulnerabilities. By itself, this would be a little abstract. To make our discussion concrete, we’ll look at this competition through the lens of a specific vulnerability in the Linux kernel called “needle”.
What is AIxCC?
The Artificial Intelligence Cyber Challenge (AIxCC) is a two-year competition run by US DARPA and ARPA-H to encourage the development of novel AI systems to secure critical code. AIxCC will award a cumulative total of $29.5 million. Competing systems are tasked with finding vulnerabilities in software, providing evidence that what they found are truly vulnerabilities, and developing fixes to those vulnerabilities (without interfering with the software’s normal functionality). These competing systems are called “cyber reasoning systems” (CRSs). The AIxCC Semifinal Competition will be at DEF CON 2024, and the AIxCC Final Competition will be at DEF CON 2025. Systems that accept the prizes, or enter the Final Competition, must themselves be released as OSS; this will enable further improvement and collaboration. For more about AIxCC, see <https://aicyberchallenge.com/>.
AIxCC is a research challenge, and not all research pans out. But if we don’t strive for significant new technologies, we will never get them. AIxCC is an effort to encourage the development of cutting-edge new approaches, and the world really could use better tools to help improve software security.
The AIxCC registration deadline has passed, but some competitors are likely to contact others (including OSS projects) as they develop their CRSs. It’d be a great benefit to society if this competition produced practical results, so it’s good for us to consider what such tools might need to do when confronted with real-world software.
Sample vulnerability in the Linux kernel
The Linux kernel is widely used worldwide. Androids and Chromebooks, as well as many servers, routers, and TVs, use it. Most of Google runs on it, and all of the top 500 supercomputers use it. A number of vulnerabilities have been found in the Linux kernel through the years, in part because it’s publicly reviewed by so many, and in part because it’s large (because it’s used in so many different situations).
To make things more concrete, let’s pick a specific example of a past vulnerability in the Linux kernel. We’ll pick the vulnerability CVE-2023-0179, Netfilter Buffer Overflow; colloquially this is sometimes called “Needle”. Google gave this vulnerability a CVSSv3 ranking of 7.8 out of 10 (high), suggesting that this vulnerability is likely to be concerning where it applies. Some attack proof of concept code is publicly available.
Let’s quickly look at the technical details for it. Davide Ornaghi publicly posted a detailed analysis of this vulnerability, where he stated that “The vulnerability consists of a stack buffer overflow due to an integer underflow vulnerability inside the nft_payload_copy_vlan function, which is invoked with nft_payload expressions as long as a VLAN tag is present in the current skb.” The key line is this line of C code:
ethlen -= offset + len - VLAN_ETH_HLEN + vlan_hlen;
In this code, ethlen is an unsigned 8-bit value (a “u8”). The “-=” subtracts a value from ethlen, as calculated by the expression to the right of “-=”. In certain circumstances, that calculated value can be more than ethlen’s current value. What happens in that case? Well, by definition an unsigned value can’t be negative, so the resulting value of ethlen would instead wrap around zero to its maximum value and start decreasing, potentially resulting in a value too large to be a valid length. Later on this calculated length is used when copying a buffer. C doesn’t check lengths when copying data, and instead depends on programmers calculating lengths correctly. This means that in some cases an attacker can overwrite other regions of memory, effectively controlling them. Since those other memory regions are used for other purposes, this can lead to possible compromise of the entire system.
We’ve intentionally chosen this vulnerability as an example because, believe it or not, this is a relatively simple vulnerability. The mistake (incorrect length calculation) and where it can be exploited (the later buffer copy) are in the same file, in fact, they’re in the same function. In addition, the mistake can be considered a single line. Some vulnerabilities in real-world software involve many lines, many functions, and many interacting files. In addition, the Linux kernel code is complex, in part because its performance really matters.
Another reason to choose this vulnerability is that it’s an example of a “memory safety” vulnerability. Unlike most programming languages, C does not prevent common memory access errors, such as reading or writing outside a buffer’s range. Instead, C expects the developer to always correctly calculate and check ranges as necessary to ensure reading and writing never occurs outside a buffer’s range. It’s unfortunately easy to make mistakes, and mistakes can be hard to accurately detect. For example, unsigned integers can’t store negative numbers; attempting to do so will cause the value to “wrap around” starting at its largest allowed value. Yet many algorithms depend on this wrapping around, or work properly with it, so alerting on a wrap around will produce too many false reports. Many large programs written in C and C++ have memory safety issues, and that includes the Linux kernel.
How might AIxCC competitors find this vulnerability?
There are several ways this vulnerability could be found and described, including existing processes like guided fuzzing, static code analysis, etc. The competing CRSs aren’t constrained to finding vulnerabilities in any particular way. However, the competition anticipates that competitors will use Artificial Intelligence (AI) for at least some tasks.
Each CRS is resource-constrained and limited in various ways. This ensures teams have a level playing field and compute resources do not amount to an unfair advantage. CRSs also don’t have general access to the Internet, so they can’t just “look up” correct answers during a competition. See the AIxCC Semifinal Competition (ASC) rules if you’re interested in details.
As an exercise, let’s see whether or not an AI system, specifically a Large-Language Model (LLM), could:
- Recognize this vulnerability from the source code
- Accurately describe the vulnerability, and
- Create an adequate fix.
In this section, we’ll start by trying to find a vulnerability and describe it (the first two steps). In the next section, we’ll discuss creating an adequate fix. A particular competitor might do things differently; in fact, the whole point of the competition is to encourage widespread experimentation. Still, we can learn a lot by briefly looking at what it would take to do this.
The naive approach (and the approach we generally start with) is simply to ask a model whether or not the function is vulnerable. It seems simple, and albeit contrived, as we know the function to look at, but it’s not unreasonable to start with a simple approach. After all, in a real system, we could repeatedly ask that question for each function. Once simpler cases are addressed, we could transition to more sophisticated approaches, such as multi-function or call-stack analysis. Competitors may use more sophisticated approaches, and we’re certainly not trying to constrain what they do. Instead, we’re simply trying to provide a simple example of what a competitor might do.
To gather early results, we can ask a set of strong models to mark the function as vulnerable or not. This doesn’t tell us if the model understands exactly what the vulnerability is, but it provides early feedback about its capabilities. We’ll use the following conversation format for this experiment:
— [system]: You are a senior Linux developer who specializes in code security. [user]: Analyze this source code and identify if it contains a security vulnerability. <source> static bool nft_payload_copy_vlan(...) ... </source> Produce the following output: <is-vulnerable></is-vulnerable> —
The choice of prompting, formatting, context, and sampling parameters can have a large impact on outputs. We might opt to provide the model source code of this function, and a calling function, or related structure definitions. Entire areas of research are dedicated to optimizing these elements and we won’t cover them here. It’s also worth noting that information about this vulnerability is likely included in the training data for many of these models, which we’ll talk about later. In short, this is not an academic effort, instead we want to focus on the process.
Here are the results from some ~75 passes for each model:
Provider | Model1 | Marked |
gemini-1.5-pro (001) | 73 / 73 | |
codechat-bison (002) | 0Â / 73 | |
OpenAI | gpt-4-turbo (turbo-2024-04-09) | 73 / 73 |
OpenAI | gpt-4o (2024-05-13) | 69 / 73 |
Anthropic | claude-3-opus (20240229) | 0Â / 73 |
Anthropic | claude-3-sonnet (20240229) | 58 / 73 |
Mistral | mistral-large-latest (2402) | 49 / 73 |
Mistral | mistral-medium-latest (2312) | 12 / 72 |
Mistral | codestral-latest (2405) | 5Â / 712 |
Groq | llama3-70b-8192 | 67 / 73 |
1 We left sampling parameters as the defaults for all the models and testing was performed with the rigging framework.
2 Some models failed to produce a valid response, hence varied maximums.
A few interesting elements float to the surface:
- While some models produced the same output consistently, you’ll notice a variance in many, showing the power of averages. The stochastic nature of LLMs is often a frustration in structured tasks, but if used to your advantage, a model might find a rare context that improves its capability.
- Despite Claude 3 Opus being a more “performant” model than Sonnet by some measures, it delivered very poor results for this test. Whether this stems from differences in data sets, training procedures, model size, or sampling sensitivity is beyond our scope here, but in short, “better” models aren’t always better for every task.
Let’s look at output examples from top 3 performers at each provider:
gemini-1.5-pro-001
— The code snippet you provided contains a potential integer overflow vulnerability in the following line: ```c ethlen -= offset + len - VLAN_ETH_HLEN + vlan_hlen; ``` **Explanation:** 1. **Unsigned Integer Underflow:** The `ethlen` variable is of type `u8`, which is an unsigned 8-bit integer. This means it can hold values from 0 to 255. 2. **Potential Underflow:** If the result of `offset + len - VLAN_ETH_HLEN + vlan_hlen` is larger than `ethlen`, the subtraction will result in an underflow. Since `ethlen` is unsigned, it will wrap around to a very large number (255 - difference). 3. **Buffer Overflow in memcpy:** This large value of `ethlen` is then used in the subsequent `memcpy` call: ```c memcpy(dst_u8, vlanh + offset - vlan_hlen, ethlen); ``` If `ethlen` becomes abnormally large due to the underflow, `memcpy` will read beyond the bounds of the `vlanh` buffer and potentially overwrite adjacent memory, leading to a buffer overflow vulnerability. … —
gpt-4-turbo
— Based on the analysis of the provided C function, the key issue in terms of a security vulnerability appears to be related to the improper handling of buffer lengths and offsets. This may lead to buffer overflows, which are critical security vulnerabilities, especially in systems-level code like that managing network traffic. Here is a breakdown of the potential issues: 1. The function `nfl_payload_copy_vlan` attempts to copy VLAN header data from a network packet (skb) into a user-provided buffer (`d`), indexed by `offset` and constrained by `len`. 2. The function checks and manipulates `offset` and `len` based on the `vlan_hlen`, which can be either `0` or `4` (if `skb->protocol` is one of `ETH_P_8021AD` or `ETH_P_8021Q`). 3. The operation `memcpy(dst_u8, vlanh + offset - vlan_hlen, ethlen);` looks problematic. It adjusts `ethlen` possibly reducing it via `ethlen -= offset + len - VLAN_ETH_HLEN + vlan_hlen;` without adequately ensuring that `ethlen` remains non-negative or within safe bounds before it is used in `memcpy`. This line risks copying incorrect amounts of data potentially leading to buffer overflows if `ethlen` becomes negative or larger than intended because of faulty arithmetic or unexpected parameter values. 4. After adjusting `len` and `dst_u8`, if `len` is not zero, the function makes a direct call to `skb_copy_bits` with potentially unvalidated `offset`, which has been modified and might be out of bounds of the actual data buffer in `skb`. 5. Insufficient validation of the `offset + len > VLAN_ETH_HLEN + vlan_hlen` condition and the subsequent calculations could lead to scenarios where `ethlen` or the calculation of the next `offset` is miscalculated, leading to out-of-bound accesses. ... —
llama3-70b-8192
— <is-vulnerable>true</is-vulnerable> The code contains a security vulnerability. The vulnerability is a classic buffer overflow vulnerability. The `memcpy` function is used to copy data from `vlanh + offset - vlan_hlen` to `dst_u8` with a length of `ethlen`. However, there is no check to ensure that the destination buffer `dst_u8` has enough space to hold the copied data. This can lead to a buffer overflow, allowing an attacker to write arbitrary data to the memory. Additionally, the `skb_copy_bits` function is used to copy data from the `skb` buffer to `dst_u8`, but again, there is no check to ensure that the destination buffer has enough space. … —
These are strong results from all 3 models. Note that their outputs vary in length and detail every time they were requested – we’ve selected semi-random examples to show here. (If you’re curious, we’ve publicly shared the detailed dataset here, here, here, and here.) The specific details combined with their training cut-off dates might also lead you to assume the model is simply repeating information it was already shown, which is a great intuition we mentioned above. We know this particular vulnerability was released in March 2023, so we can intentionally pick a model like “gpt-4-0613” which has a reported cut-off date of September 2021. Its response is still promising:
gpt-4-0613 — Analyzing the provided code, the security risk seems to be minimal. The code generally checks for errors and conditions that might lead to out-of-bounds memory access. However, there is a potential lack of validation for certain input arguments which could lead to buffer overflow. One possible point of concern is the `memcpy` function call with `dst_u8, vlanh + offset - vlan_hlen, ethlen`, there may be potential for a buffer overflow if ethlen calculation is not accurate, and `dst_u8` or `vlanh + offset - vlan_hlen` point to an invalid memory location or if they point to a buffer that's smaller than `ethlen`. There is also a potential risk if offset is manipulated by an attacker in a way that could point to a sensitive memory location, which would lead to memory disclosure or corruption. Another possible point is the return from `skb_copy_bits` is expected to be zero upon success. If it returns a negative value, indicating an error, this function would still return true. … —
Isolated examples of long-form descriptions might not be the best structure to work from, but models clearly show some amount of strength in simply inspecting a function without any other guidance. You might imagine some extensions of this triage process:
- Gathering some N analyses for a function across various models, then merging them statically or with another generation event.
- Forcing a model to mark the specific line numbers of concern, and taking averages to identify “hot areas” of security concern.
- Asking a model to review its previous response(s) and condense, clarify, merge, or refactor them to firm up conclusions.
- Compare results of generative models with those of static tooling to validate claims, maybe even provide those static outputs of source code as context to the model.
- Using trained auto-encoder models designed to process source code into features such as likelihood of containing a vulnerability, many of which are available on huggingface.
The actual AIxCC competition requires proofs of vulnerabilities, not just assertions that something is vulnerable. A real CRS must do more to create that proof. Still, we hope that this small introduction gives a taste of the issues that must be addressed.
To be continued
Our goal isn’t just finding vulnerabilities – we want to fix vulnerabilities. This blog post is getting long, though, so we’ll continue this discussion in part 2.
About the Authors
Will Pearce was the AI Red Team Lead for NVIDIA and Microsoft, he co-authored the Blackhat Machine Learning course, and has a body of public research on AI Red Teaming. Before diving into AI security, Will was a Senior Security Consultant and Network Operator at Silent Break Security, where he performed network operations, security research, and was an instructor for the popular Dark Side Ops courses.
Nick Landers previously served as Director of R&D at Silent Break Security and as VP of Research at NetSPI. Nick built tools in support of offensive operations, and authored the Dark Side Ops Courses given at industry conferences like Blackhat, as well as public and private groups.
David A. Wheeler is the Director of Open Source Supply Chain Security at the Open Source Security Foundation (OpenSSF) and teaches a graduate course in developing secure software at George Mason University (GMU). He is an expert on open source software (OSS) and on developing secure software. Dr. Wheeler has a PhD in Information Technology, a Master’s in Computer Science, a certificate in Information Security, a certificate in Software Engineering, and a B.S. in Electronics Engineering, all from George Mason University (GMU). He is a Certified Information Systems Security Professional (CISSP) and Senior Member of the Institute of Electrical and Electronics Engineers (IEEE). David lives in Northern Virginia.