By Will Pearce, Nick Landers, and David A. Wheeler
In part 1 we discussed the Artificial Intelligence Cyber Challenge (AIxCC), a two-year competition to create AI systems that find software vulnerabilities and develop fixes to them. We also discussed a specific vulnerability in the Linux kernel, called needle, as an example of the kind of vulnerability we’d like such tools to find and fix. In part 1 we discussed how such tools might be able to find vulnerabilities. Now let’s talk a little bit about how they might fix them. Real competitors in AIxCC might do things differently; this article simply helps us understand what they’re trying to do.
How might AIxCC competitors fix the vulnerability?
Finding a vulnerability is neat, but as we just noted, the subsequent fix is the end goal of this competition. In this instance, we have information from the original report regarding the fix in the real Linux kernel in commit 696e1a48b1a1, which fixes the calculation by changing a “+” to “-”:
—
if (offset + len > VLAN_ETH_HLEN + vlan_hlen)
– ethlen -= offset + len – VLAN_ETH_HLEN + vlan_hlen;
+ ethlen -= offset + len – VLAN_ETH_HLEN – vlan_hlen;
—
This offers us a great opportunity to not only see whether models would suggest a correct fix for the issue, but whether they might provide an “efficient” variant or “robust” of that fix. We can opt to either:
- Use the original vulnerability reported description which is thorough, human-authored, and covers explicit details about exploitation, or
- Leverage the model’s own outputs from the earlier discovery step as the reference description. While we didn’t ask for a description specifically, models tend to discuss the content in front of them while building structured outputs. We can either go back to models to formalize, or simply use the longest output from each model which assumes it provided the most commentary on the function.
We’ll again take a simple (potentially naive) approach for creating a fix to a vulnerability, in this case, by creating a simplistic conversation template. A more sophisticated approach might use a more sophisticated template, but instead, here is the simple conversation template we’ll use:
—
[system]: You are a senior Linux developer who specializes in code security.
[user]: Rewrite the source code to fix the vulnerability described.
<vulnerable-function>
…
</vulnerable-function>
<vulnerability-description>
…
</vulnerability-description>
Produce the following output:
<fixed-function></fixed-function>
—
Measuring the “success” of these code updates is its own challenge. Ideally we would deliver code updates into a build system to perform functional and security tests. During the competition, competitors will have any code changes confirmed for both fixing the vulnerability and maintaining functionality.
For this exercise, we’ll instead just measure the edit distance between the accepted patch and the patch that the model produces. Because the accepted patch includes minimal changes, lower edit distances could imply that a model suggested only small updates to the function, or that those changes are in-line with the accepted single line change. In short, the real insight from these measurements is that higher scores represent more significant changes to the function. Obviously a real competing system trying to create a patch won’t know what the “correct” patch is, but measuring this can help us get a quick sense of how close these models get to a reasonably good patch.
Provider | Model | Avg Ref Sim1 | Avg Gen Sim2 |
gemini-1.5-pro (001) | 249.4 | 479.875 | |
codechat-bison (002) | 180.6 | 128.0 | |
OpenAI | gpt-4-turbo (turbo-2024-04-09) | 752.2 | 791.875 |
OpenAI | gpt-4o (2024-05-13) | 298.5 | 501.25 |
Anthropic | claude-3-opus (20240229) | 123.1 | 69.0 |
Anthropic | claude-3-sonnet (20240229) | 138.6 | 583.75 |
Mistral | mistral-large-latest (2402) | 126.2 | 383.0 |
Mistral | mistral-medium-latest (2312) | 296.4 | 482.875 |
Mistral | codestral-latest (2405) | – | 330.75 |
Groq | llama3-70b-8192 | 206.4 | 382.125 |
1 Mean average edit distance between the accepted patch and the model-provided patch using the original (reference) vulnerability description.
2 Mean average edit distance between the accepted patch and the model-provided patch using one of the 5 longest previous responses from function triage.
The numbers provide some early insights on the size of updates from each model, but like before, we need to dive into some examples for clarity.
claude-3-opus
—
static bool nft_payload_copy_vlan(u32 *d, const struct sk_buff *skb, u8 offset, u8 len)
{
…
if (offset < VLAN_ETH_HLEN + vlan_hlen) {
– u8 ethlen = len;
+ u16 ethlen = len;
if (vlan_hlen &&
skb_copy_bits(skb, mac_off, &veth, VLAN_ETH_HLEN) < 0)
return false;
else if (!nft_payload_rebuild_vlan_hdr(skb, mac_off, &veth))
return false;
if (offset + len > VLAN_ETH_HLEN + vlan_hlen)
– ethlen -= offset + len – VLAN_ETH_HLEN + vlan_hlen;
+ ethlen = VLAN_ETH_HLEN + vlan_hlen – offset;
+
+ if (ethlen > len)
+ ethlen = len;
memcpy(dst_u8, vlanh + offset – vlan_hlen, ethlen);
…
}
—
The first part of this change isn’t necessary in this particular situation (which changes a data type from u8 to u16). The original problem was that the values went below 0 (and wrapped around), not that the values being used were too big for the data type. That said, a risk of storing values in the u8 type is that it has few valid values (it can only represent the integers 0 through 255 inclusive). So it could be argued that this change makes the code more robust if the data size ever needed to be larger.
The middle change fixes the vulnerability. The changed calculation of ethlen as “VLAN_ETH_HLEN + vlan_hlen – offset” is correct! This code does not look the same as the accepted change, but it’s the same once you simplify the calculation using a little Algebra (after you realize that in the original code ethlen starts with the value “len”). This is a promising suggestion that these tools can, at least sometimes, create correct fixes.
The last lines add a comparison of ethlen to len at the end that isn’t in the real (accepted) fix. This appears to be a spurious check in this specific context. However, this type of checking is often done to verify an overflow did not happen, or that a copy is not going to go out of range, so the fact that it was added shows that this is a common pattern that is being recognized.
mistral-large-latest
—
static bool nft_payload_copy_vlan(u32 *d, const struct sk_buff *skb, u8 offset, u8 len)
{
…
if (offset < VLAN_ETH_HLEN + vlan_hlen) {
– u8 ethlen = len;
+ u32 ethlen = len; // Change type from u8 to u32
…
}
—
This change fails to deal with the problem. Again, the LLM is changing to a type that can store more values. This does reduce the risk of computing a value too large to fit, but it doesn’t fix the incorrect calculation that makes the value too small.
llama3-70b-8192
—
static bool nft_payload_copy_vlan(u32 *d, const struct sk_buff *skb, u8 offset, u8 len)
{
…
if (offset < VLAN_ETH_HLEN + vlan_hlen) {
– u8 ethlen = len;
+ u16 ethlen = len;
if (vlan_hlen &&
skb_copy_bits(skb, mac_off, &veth, VLAN_ETH_HLEN) < 0)
return false;
else if (!nft_payload_rebuild_vlan_hdr(skb, mac_off, &veth))
return false;
– if (offset + len > VLAN_ETH_HLEN + vlan_hlen)
– ethlen -= offset + len – VLAN_ETH_HLEN + vlan_hlen;
+ if (offset + len > VLAN_ETH_HLEN + vlan_hlen) {
+ u16 remaining = VLAN_ETH_HLEN + vlan_hlen – offset;
+ if (remaining > len)
+ remaining = len;
+ ethlen = remaining;
+ }
memcpy(dst_u8, vlanh + offset – vlan_hlen, ethlen);
…
}
—
This code looks different from the code from claude-3-opus but is in practice the same. This does fix the calculation. It also widens the data type (which isn’t necessary, but is understandable since it reduces the risk of calculating values that are too large to be stored). It also checks against “len” which is not necessary here.
gpt-4-turbo
—
static bool nft_payload_copy_vlan(u32 *d, const struct sk_buff *skb, u8 offset, u8 len)
{
…
if ((skb->protocol == htons(ETH_P_8021AD) ||
skb->protocol == htons(ETH_P_8021Q)) &&
offset >= VLAN_ETH_HLEN && offset < VLAN_ETH_HLEN + VLAN_HLEN)
– vlan_hlen += VLAN_HLEN;
+ {
+ vlan_hlen = VLAN_HLEN;
+ }
vlanh = (u8 *) &veth;
if (offset < VLAN_ETH_HLEN + vlan_hlen) {
u8 ethlen = len;
if (vlan_hlen &&
skb_copy_bits(skb, mac_off, &veth, VLAN_ETH_HLEN) < 0)
return false;
else if (!nft_payload_rebuild_vlan_hdr(skb, mac_off, &veth))
return false;
– if (offset + len > VLAN_ETH_HLEN + vlan_hlen)
– ethlen -= offset + len – VLAN_ETH_HLEN + vlan_hlen;
+ if (offset + len > VLAN_ETH_HLEN + vlan_hlen) {
+ if (len > offset + len – VLAN_ETH_HLEN + vlan_hlen)
+ ethlen -= offset + len – VLAN_ETH_HLEN + vlan_hlen;
+ else
+ return false; // Avoid underflow by validating before subtraction
+ }
+
+ if (ethlen > len)
+ ethlen = len;
memcpy(dst_u8, vlanh + offset – vlan_hlen, ethlen);
len -= ethlen;
if (len == 0)
return true;
+ offset = VLAN_ETH_HLEN + vlan_hlen;
dst_u8 += ethlen;
– offset = ETH_HLEN + vlan_hlen;
} else {
– offset -= VLAN_HLEN + vlan_hlen;
+ offset -= VLAN_ETH_HLEN + vlan_hlen;
}
– return skb_copy_bits(skb, offset + mac_off, dst_u8, len) == 0;
+ if (len > 0 && offset + len <= skb_headlen(skb))
+ return skb_copy_bits(skb, offset + mac_off, dst_u8, len) == 0;
+ else
+ return false; // Avoid potential out-of-bounds read
}
—
The first part of the change is technically correct in context. It assigns a value instead of adding a value, but since the starting value is 0, there is no technical difference. However, from a Linux kernel coding style, this is a wrong coding style, implying that it was trained on code that is not Linux kernel code.
However, the later parts fail to fix the calculation.
You can see in these examples the variance in robustness, formatting, structure, and commentary of the edits from various models. GPT 4 Turbo appears to take a stronger approach to edits, opting to change more code with significant AST updates (we also noted some very strange examples with randomly-generated code – hence their high score). Other models like Mistral Large opt to simply change the type of the `ethlen` variable, similar to Claude without additional logic checks.
Will AI-based tools be able to create fixes in many cases?
AI systems today, specifically current LLMs, appear to struggle somewhat to create correct proposed fixes for existing code that fixes vulnerabilities. At least they struggle when using naive approaches as we’ve done here. Yet the results are tantalizing, even when using such a simple approach.
Nothing prevents competitors from trying alternative less-naive approaches. In fact, the competition is specifically designed to encourage experimentation. In addition, there is a growing body of research to study the use of large models in vulnerability identification, patch creation, and general static software analysis. Here are a few examples:
- Technical Report — Expected Exploitability: Predicting the Development of Functional Vulnerability Exploits
- Enhanced Automated Code Vulnerability Repair using Large Language Models
- DefectHunter: A Novel LLM-Driven Boosted-Conformer-based Code Vulnerability Detection Mechanism
- Using ChatGPT as a Static Application Security Testing Tool
- When Less is Enough: Positive and Unlabeled Learning Model for Vulnerability Detection
- VulEval: Towards Repository-Level Evaluation of Software Vulnerability Detection
- An Unbiased Transformer Source Code Learning with Semantic Vulnerability Graph
- Evaluation of ChatGPT Model for Vulnerability Detection
- Illuminati: Towards Explaining Graph Neural Networks for Cybersecurity Analysis
- Large Language Models for Cyber Security: A Systematic Literature Review
The AIxCC event doesn’t dictate the proportion or specific elements of cyber reasoning systems which are supported by AI. This should yield an insightful mix of solutions from the participants with varying backgrounds. We expect some competitors will primarily leverage traditional static analysis and fuzzing toolkits for the detection of vulnerable code, then follow with language models only for suggesting patches. Other competitors might use language models earlier in the pipeline, having them analyze source code at scale or create dynamic fuzzing harnesses with feedback. The OSS-Fuzz team at Google have been publishing great research for the latter; see:
- Introducing LLM-based harness synthesis for unfuzzed projects
- Project oss-fuzz-gen (A Framework for Fuzz Target Generation and Evaluation)
Our simple approach to request fixes shows that it’s sometimes possible for such tools to create fixes, but also that there’s a lot of work to be done.
Sadly, today’s LLM systems often generate code with vulnerabilities when asked to create new code. The 2023 paper “Security Weaknesses of Copilot Generated Code in GitHub” by Fu et al found that “35.8% of Copilot generated code snippets contain [vulnerabilities in well-known CWE categories], and those issues are spread across multiple languages” and emphasized that “developers should be careful when adding code generated by Copilot (and similar AI code generation tools) and should also run appropriate security checks as they accept the suggested code.” However, modifying existing code (instead of creating wholly new code) seems like it should be a somewhat simpler problem. In addition, we do not presume that humans will just accept whatever the tool generates. Such code should be reviewed for security issues, just like human-created code.
One thing we did not do is attempt to apply the fix, and then run tests. One obvious solution is to accept a proposed change if it passes all functional tests, while rejecting it (and feeding those results back to a tool) if a functional test fails or if it fails to fix a generated attack. We also didn’t attempt to approach these models in any special way to focus on this specific problem. Will that help? We don’t know. Our goal was to simply illustrate a basic approach, and we’ve done so.
Obviously it would not be acceptable if tools’ poor fixes would increase work for project developers. What’s needed is not just vulnerability reports or code fixes. What’s needed is high-quality vulnerability finding, coupled with code fixes with high enough quality that the whole process is worthwhile. A “real” tool would probably also need to provide some clear explanatory text to explain the problem and why the proposed change solves the problem. Software developers already have limited time; we need tools to save time overall.
The only way we see if adequate systems can be created is to try to create them. The point of the competition is to encourage smart people to try out approaches to see what works & what doesn’t. The competition sets a high bar for perfect success to incentivize novel research and design. We’re excited to see all the lessons and tools that will probably result. However, note that the final tools don’t need to be perfect to be useful in the real world; they simply need to help save time overall. Humans should review the results of such tools. Even if the tool’s recommendations are sometimes wrong, if it speeds up time overall it’s a win.
Some may comment that another approach to fixing this specific vulnerability would be switching to a memory-safe language. The incorrect calculation of the length as an unsigned value, as done in this case, would typically not be detected by a memory-safe language. However, a memory-safe language would normally detect any attempt to use the calculated length to access data outside a buffer’s range, and properly applied would have prevented exploitation of this specific vulnerability. Proper use of a memory-safe language would typically involve using constructs where the buffer’s length is known, at least at runtime, and thus can be automatically checked.
Using a memory-safe language is another approach, but using a memory-safe language and using AI aren’t mutually exclusive. Rewriting all programs currently in memory-unsafe languages (mainly C and C++) would cost trillions of dollars, with significant risks every time the translation makes a mistake. As a result, rewriting simply can’t be applied everywhere nor can it be applied all at once. The Linux kernel is working to enable writing device drivers in a memory-safe language (Rust), and there have been proposals (along with working code) to rewrite even more of the Linux kernel in a memory safe language. The Linux kernel is performing incremental evolution for improvements, as that’s the only practical approach, and this means that it’s helpful to have mechanisms to counter memory safety vulnerabilities. Even when using a memory safe language, those safeties sometimes need to be disabled, and in those cases other checks and corrections are still important.
It’s also important to note that the approach of using AI to find and fix vulnerabilities isn’t limited to memory safety issues. In fact, the competition scoring algorithm specifically rewards systems that can address many kinds of vulnerabilities. As a result, if AI-based approaches can be made to work well, they’re potentially a general approach.
Conclusion
This is an interesting time to be alive! The term “AI” was coined in the 1950s, but only now has it been developed enough that it might be useful for these purposes. The AIxCC competition is encouraging experimentation to create tools that can help us address a real problem: vulnerabilities in the software we all depend on. The tools don’t need to be perfect; they could miss some vulnerabilities, or occasionally create wrong results, and still be useful. We would expect humans to review their results, just as humans should review other humans’ work today. But if these tools overall save time for developers, and overall improve the software we all depend on, the world will be a better place.
About the Authors
Will Pearce was the AI Red Team Lead for NVIDIA and Microsoft, he co-authored the Blackhat Machine Learning course, and has a body of public research on AI Red Teaming. Before diving into AI security, Will was a Senior Security Consultant and Network Operator at Silent Break Security, where he performed network operations, security research, and was an instructor for the popular Dark Side Ops courses.
Nick Landers previously served as Director of R&D at Silent Break Security and as VP of Research at NetSPI. Nick built tools in support of offensive operations, and authored the Dark Side Ops Courses given at industry conferences like Blackhat, as well as public and private groups.
David A. Wheeler is the Director of Open Source Supply Chain Security at the Open Source Security Foundation (OpenSSF) and teaches a graduate course in developing secure software at George Mason University (GMU). He is an expert on open source software (OSS) and on developing secure software. Dr. Wheeler has a PhD in Information Technology, a Master’s in Computer Science, a certificate in Information Security, a certificate in Software Engineering, and a B.S. in Electronics Engineering, all from George Mason University (GMU). He is a Certified Information Systems Security Professional (CISSP) and Senior Member of the Institute of Electrical and Electronics Engineers (IEEE). David lives in Northern Virginia.