Tag

AI Cyber Challenge

What’s in the SOSS? Podcast #52 – S3E4 AIxCC Part 2 – From Skeptics to Believers: How Team Atlanta Won AIxCC by Combining Traditional Security with LLMs

By Podcast

Summary

In this 2nd episode in our series on DARPA’s AI Cyber Challenge (AIxCC), CRob sits down with Professor Taesoo Kim from Georgia Tech to discuss Team Atlanta’s journey to victory. Kim shares how his team – comprised of academics, world-class hackers, and Samsung engineers – initially skeptical of AI tools, underwent a complete mindset shift during the competition. He shares how they successfully augmented traditional security techniques like fuzzing and symbolic execution with LLM capabilities to find vulnerabilities in large-scale open source projects. Kim also reveals exciting post-competition developments, including commercialization efforts in smart contract auditing and plans to make their winning CRS accessible to the broader security community through integration with OSS-Fuzz.

This episode is part 2 of a four-part series on AIxCC:

Conversation Highlights

00:00 – Introduction
00:37 – Team Atlanta’s Background and Competition Strategy
03:43 – The Key to Victory: Combining Traditional and Modern Techniques
05:22 – Proof of Vulnerability vs. Finding Bugs
06:55 – The Mindset Shift: From AI Skeptics to Believers
09:46 – Overcoming Scalability Challenges with LLMs
10:53 – Post-Competition Plans and Commercialization
12:25 – Smart Contract Auditing Applications
14:20 – Making the CRS Accessible to the Community
16:32 – Student Experience and Research Impact
20:18 – Getting Started: Contributing to the Open Source CRS
22:25 – Real-World Adoption and Industry Impact
24:54 – The Future of AI-Powered Security Competitions

Transcript

Intro music & intro clip (00:00)

CRob (00:10.032)
All right, I’m very excited to talk to our next guest. I have Taesoo Kim, who is a professor down at Georgia Tech, also works with Samsung. And he got the great opportunity to help shepard Team Atlanta to victory in the AIxCC competition. Thank you for joining us. It’s a really pleasure to meet you.

Taesoo Kim (00:35.064)
Thank you for having me.

CRob (00:37.766)
So we were doing a bunch of conversations around the competition. I really want to showcase like the amazing early cutting edge work that you and the team have put together. So maybe, can you tell us what was your team’s approach? What was your strategy as you were kind of approaching the competition?

Taesoo Kim (00:59.858)
that’s a great question. Let me start with a little bit of a background.

CRob (00:)
Please.

Taesoo Kim (00:59)
Ourself, our team, Atlanta, we are multiple group of people in various backgrounds, including me as academics and researchers in security area. We also have world-class hackers in our team and some of the engineers from Samsung as well. So we have a little bit of background in various areas so that we bring our expertise.

Taesoo Kim (01:29.176)
to compete in this competition. It’s a two-year journey. We put a lot of effort, not just engineering side, we also tinkled with lot of research approach that we’ve been working on this area for a while. Said that, I think most important strategy that our team took is that, although it’s an AI competition…

CRob (01:51.59)
Mm-hmm.

Taesoo Kim (01:58.966)
…meaning that they promote the adoption of LLM-like techniques, we didn’t simply give up in traditional analysis technique that we are familiar with. It means we put a lot of effort to improve, like fuzzing is one of the great dynamic testing for finding vulnerability, and also traditional techniques like symbolic executions and concocted executions, even directed fuzzing. Although we suffer from lot of scalability issues in those tools, because one of themes of AIxCC is to find bugs in the real world.

And large-scale open source project. It means most of the traditional techniques do not scale in that level. We can analyze one function or a small number of code in the source code repository when it comes to, for example, Linux or Nginx. This is crazy amount of source code. Like building a whole graph in this gigantic repository itself is extremely hard. So that we start augmenting LLM in our pipeline.

One of the great examples of fuzzing is that when we are mutating input, although we leverage a lot of mutation techniques in the fuzzing side, we also leverage the understanding of LLM in a way that LLM also navigates the possibility of mutating places in the source code in a way that they can generate some of the dictionaries, providing vocabulary for fuzzer, and realize the input format that they have to mutate as well. So lot of augmentations of using LLM happen all over the places in traditional software analysis technique that we are doing.

CRob (03:43.332)
And do you feel that combination of using some of the newer techniques and fuzzing and some of the older, more traditional techniques, do you think that that was what was kind of unique and helped push you over the victory line and the cyber reasoning challenge?

Taesoo Kim (04:01.26)
It’s extremely hard to say which one contributed the most during the competition. But I want to emphasize that finding bugs in the location of the source code versus formulating input that trigger those vulnerability in our competition, what we call as proof of vulnerability. These two tasks are completely different. You can identify many bugs.

But unfortunately, in order to say this is truly the bug, you have to prove by yourself by showing or constructing the input that triggered the vulnerability. The difficulty of both tasks are, I would say people do not comprehend the challenges of formulating input versus finding a vulnerability in the source code. You can pinpoint without much difficulty the various places in the source code.

But in fact, that’s an easier job. In practice, more difficult challenge is identifying the input that actually reach the place that you like and trigger the vulnerability as a result. So we spend much more time how to construct the input correctly to show that we really prove the existence of vulnerability in the source.

CRob (05:09.692)
Mm-hmm.

CRob (05:22.94)
And I think that’s really a key to the competition as it happened versus just someone running LLM and scanners kind of randomly on the internet is the fact that you all were incented to and required to develop that fix and actually prove that these things are really vulnerable and accessible.

Taesoo Kim (05:33.718)
Exactly.

Taesoo Kim (05:42.356)
Exactly. That also highlights what practitioners care about. So you ended up having so many false positives in the security tools. No one cares. There are a of complaints about why we are not using security tools in the first place. So this is one of the important criteria of the competition. one of the strengths in traditional tools like buzzer and concord executor, everything centers around to reduce the false positives. The region people.

CRob (05:46.192)
Yes.

Taesoo Kim (06:12.258)
take Fuzzer in their workflow. So whenever Fuzzer says there is a vulnerability, indeed there is a vulnerability. There’s a huge difference. So that we start with those existing tool and recognize the places that we have to improve so that we can really scale up those traditional tool to find vulnerability in this large scale software.

CRob (06:36.568)
Awesome. As you know, the competition was a marathon, not a sprint. So you were doing this for quite some time. But as the competition was progressing, was there anything that surprised you in the team and kind of changed your thinking about the capabilities of these tools?

Taesoo Kim (06:51.502)
Ha

Taesoo Kim (06:55.704)
So as I mentioned before, we are hackers. We won Defqon CTF many times and we also won F1 competition in the past. So by nature, we are extremely skeptical about AI tool at the beginning of the competition. Two years ago, we evaluated every single existing LLM services with the benchmark that we designed. We realized they are all not usable at all.

CRob (07:09.85)
Mm-hmm.

Taesoo Kim (07:24.33)
not appropriate for the competition. Instead of spending time on improving those tools, which we felt like inferior at the beginning, so our motto at that time, our team, don’t touch those areas. We’re going to show you how powerful these traditional techniques are. So that’s why we progressed the semi-final. We did pretty well. We found many of the bugs by using all the traditional tools that we’ve been working on. But like…

Immediately after semifinal, everything changed. We reevaluated the possibility of adopting LLM. At that time, just removing or obfuscating some of the tokens in the repository, the LLM couldn’t even reason anything about. But suddenly, near or around semifinal, something happened. We realized that even after we inject or

If you think of it this way, there is a token, and you replace this token with meaningless words. LLM previously all confused about all these synthetic structures of the source code, but now, or on semifinal, they really understand. Although we tried to fool many times, you really catch up the idea, which is a source code that they never saw before, never used in the training, because we intentionally create this source code for the evaluation.

We start realizing that we actually understand. We shock everybody. So we start realizing that there are so many places, if that’s the case, there are so many places that we can improve. Right? So that’s the moment that we change our mindset. So now everything about LLM, everything about the new Asian architectures, so that we ended up putting humongous amount of efforts creating various architectures of Asian design that we have.

Also, we replaced some of software analysis techniques with LLM as well, surprisingly. For example, symbolic execution is a good example. It’s extremely hard to scale. Whenever you execute one instruction at a time, you have to create the constraint around them. But one of the big challenges in real-world software, there are so many, I would say, hard-to-analyze functions exist. Meaning that, for example, there is a

Taesoo Kim (09:46.026)
Even NGINX as an example, we thought that they probably compared the string to string at a time. But the way they perform string compare in NGINX, they map this string or do the hashing so that they can compare the hash value. Fudger, another symbolic executor, is extremely bad at those. If you hit one hashing function, you’re screwed. There are so many constraints that there is no way we can revert back by definition.

There’s no way. But if you think about how to overcome these situations by using LLM, the LLM can recognize that this is a hashing function. We don’t actually have to create a constraint around, hey, what about we replace with identity functions? It’s something that we can easily divert by using symbolic execution. So then we start recognizing the possibility of LLM role in the symbolic execution. Now see that.

Smaller execution can scale to the large software right now. So I think this is a pretty amazing outcome of the competition.

CRob (10:53.11)
Awesome. So again, the competition completed in August. So what plans do you have? What plans does the team have for your CRS now that the competition’s over?

Taesoo Kim (10:58.446)
Thank

Taesoo Kim (11:02.318)
I think that’s a great question. Many of tech companies approach our team. Some of them recently joined, other big companies. And many of our students want to quit the PhD program and start a company. For good reasons, right?

CRob (11:14.848)
I bet.

Taesoo Kim (11:32.766)
One of the team, my four PhD students recently formed and looking for commercialization opportunity. Not in the traditional cyber infrastructure we are looking at through the DARPA, but they spotted the possibility in smart contracts. that smart contracts and modernized financial industries like stable coins and whatnot

where they can apply the AI XTC like techniques in finding vulnerability in those areas. So that instead of analyzing everything by human auditor, you can analyze everything by using LLM or agents and similar techniques that we developed for AI XTC so that you can reduce the auditing time significantly. In order to get some auditing in the smart contract, traditionally you have to wait for two weeks.

In the worst case, even months with a ridiculous amount of cost. Typically, in order to get one auditing for the smart contract, $20,000 or $50,000 per case. But in fact, you can reduce down the amount of auditing time by, I’ll say, a few hours by day. This speed, the potential benefit of achieving this speed is you really open up

CRob (12:40.454)
Mm-hmm.

CRob (12:47.836)
Wow.

Taesoo Kim (12:58.186)
amazing opportunity in this area. So you can automate the auditing, you can increase the frequency of auditing in the smart contract area. Not only that we thought there is a possibility for even more like compliance checkings of the smart contracts, there’s so many opportunities that we can play immediately by using ARCC systems. That’s the one area that we’re looking at. Another one is more traditional area.

CRob (13:00.347)
Mm-hmm.

Taesoo Kim (13:25.07)
what we call cyber infrastructure, like hospitals and some government sectors. They really want to analyze, but unfortunately, or fortunately though, there are other opportunities that in ARCC, we analyze everything by source code, but they don’t have access to them. So we are creating the pipeline that given a binary or execution only environment, how to convert them.

CRob (13:28.828)
Mm-hmm.

CRob (13:38.236)
Mm-hmm.

CRob (13:49.569)
Taesoo Kim (13:52.416)
in a way that we can still leverage the existing infrastructure that we have for AICC. More interestingly, they don’t have access to the internet when they’re doing pen testings or analyzing those, so that we start incorporating some of our open source model as part of our systems. These are two commercialization efforts that we’re thinking and many of my students are currently

CRob (13:57.67)
That’s very clever.

CRob (14:05.5)
Yeah.

CRob (14:13.564)
It’s awesome.

CRob (14:20.366)
And I imagine that this is probably amazing source material for dissertations and the PhD work, right?

Taesoo Kim (14:29.242)
Yes, yes. Last two years, we are purely focused on ARCC. Our motto is that we don’t have time for publication. It’s just win the competition. Everything is coming after. This is the moment that we actually, I think we’re going to release our Tech Report. It’s over 150 pages. Next week, around next week. So we have a draft right now, but we are still publishing.

CRob (14:39.256)
Yeah.

CRob (14:51.94)
Wow.

Taesoo Kim (14:58.51)
for publication so that other people not just like source code okay that’s great but you need some explanation why you did this many of the sources is for the competition right so that the core pieces might be a little bit different for like daily usage of normal developers and operator so we kind of create a condensed technical material for them to understand

Not only that, we have a plan to make it more accessible, meaning that currently our CRS implementation tightly bound to the competition environment. Meaning that we have a crazy amount of resources in Azure side, everything is deployed and better tested. But unfortunately, most of the people, including ourselves, we don’t have resources. Like the competition have about

80,000 cloud credit that we have to use. So no one has that kind of resource. It’s not like that, not if you’re not a company. But we want to apply this one for your project in the smaller scale. That’s what we are currently working on. So discarding all these competition dependent parameter from the source code, making more containable so that you can even launch our CRS in your local environment.

This is one of the big, big development effort that we are doing right now in our lab.

CRob (16:32.155)
That’s awesome. take me a second and thinking about this from the students perspective that participated. What kind of an experience was it getting to work with professors such as yourself and then actual professional researchers and hackers? What do you see the students are going to take away from this experience?

Taesoo Kim (16:53.846)
I think exposing to the latest model because we are tightly collaborating with this OpenAI and Gemini, we are really exposed to those latest model. If you’re just working on the security, not tightly working for LLM, you probably don’t appreciate that much. But through the competition, everyone’s mindset change. And then we spend time.

and deeply take a look in what’s possible, what’s not, we now have a great sense of what type of problem we have to solve, even in the research side. And now, suddenly, after this competition, every single security project, security research that we are doing at Georgia Tech is based on LLF. Even more surprising to hear that we have some decompilation project that we are doing, the traditional possible security research you can read.

CRob (17:42.448)
Ha ha.

Taesoo Kim (17:52.162)
binary analysis, malware analysis, decompilations, crash analysis, whatnot. Now everything is LLM. Now we realize LLM is much better at decompiling than traditional tools like IDEA and Jydra. So I think these are the type of research that we previously thought impossible. We’re probably not even thinking about applying LLM. Because we spend our lifetime working on decompiling.

CRob (17:53.68)
Mm.

CRob (17:59.068)
Yeah.

Taesoo Kim (18:22.318)
But at a certain point, we realized that LLM is just doing better than what we’ve been working on. Just one day. It’s a complete mind change. In traditional program analysis perspective, many things are empty completely. There’s no way you can solve it in an easier way. So they’re not spending time. That’s our typical mindset. But now, it works in practice, amazingly.

CRob (18:29.574)
Yeah.

Taesoo Kim (18:51.807)
how to improve what we thought previously impossible by using another one. It’s the key.

CRob (18:57.404)
That’s awesome. It’s interesting, especially since you stated initially when you went into the competition, you were very skeptical about the utility of LLMs. So that’s great that you had this complete reversal.

Taesoo Kim (19:04.238)
Thank

Yeah, but I think I like to emphasize one of the problems of LLM though, it’s expensive, it’s slow in traditional sense, you have to wait a few seconds or a few minutes in certain cases like reasoning model or whatnot. So tightly binding your performance with this performance lagging component in the entire systems is often challenging.

CRob (19:17.648)
Yes.

CRob (19:21.82)
Mm-hmm.

Taesoo Kim (19:39.598)
and then just talking. But another benefit of everything is text. There’s no proper API, just text. There’s no sophisticated way to leverage it, just text. I don’t know, you’re probably familiar with all these security issues, potentially with unstructured input. It’s similar to cross-site scripting in the web space. There’s so many problems you can imagine.

CRob (19:51.984)
Okay, yeah.

CRob (20:01.979)
Mm-hmm.

Taesoo Kim (20:08.11)
But as far as you can use in a well-contained manner in the right way, we believe there are so many opportunities we can get from it.

CRob (20:18.876)
Great. So now that your CRS has been released as open source, if someone from our community was interested in joining and maybe contributing to that, what’s the best way somebody could get started and get access?

Taesoo Kim (20:28.494)
Mm-hmm.

So we’re going to release non-competition version very soon, along with several documents, we call standardization effort that we and other teams are doing right now. So we define non-competition CRS interface so that you can tightly, as far as you implement those interface, our goal is to mainstream OSS browser together with Google team.

CRob (20:36.369)
Mm-hmm.

CRob (20:58.524)
Mm-hmm.

Taesoo Kim (20:59.086)
so that you can put your CRS as part of OSS Fuzz mainstream, so that we can make it much easier, so that everyone can evaluate one at a time in their local environment as part of OSS Fuzz project. So we’re gonna release the RFC document pretty soon through our website, so that everyone can participate and share their opinion, what are the features that they think we are missing, that we’d love to hear about.

CRob (21:03.74)
Thanks.

CRob (21:18.001)
Mm-hmm.

Taesoo Kim (21:26.502)
And then after that, a month period, we’re going to release our local version so that everyone can start using. And with a very permissive license, everyone can take advantage of the public research, including companies.

CRob (21:34.78)
Awesome.

CRob (21:42.692)
It’s, I’m just amazed. when I came into this, partnering with our friends at DARPA, I was initially skeptical as well. And as I was sitting there watching the finals announced, it was just amazing. Kind of this, the innovative innovation and creativity that all the different teams displayed. again, congratulations to your team, all the students and the researchers and everyone that participated.

Taesoo Kim (21:59.79)
Mm-hmm.

CRob (22:12.6)
Well done. Do you have any parting thoughts? know, as you’re think, as we move on, do you have any kind of words of wisdom you want to share with the community or any takeaways for people curious to get in this space?

Taesoo Kim (22:25.486)
Oh, regarding commercialization, one thing I also like to mention is that in Samsung, we already took the open source version of the CRS, start applying the internal project and open source Samsung project immediately after. So we started seeing the benefit of applying the CRS in the real world immediately after the competition. A lot of people think that competition is just for competition or show

CRob (22:38.108)
Mm-hmm.

Taesoo Kim (22:55.032)
But in fact, it’s not. Everyone in industry, including at Tropic Meta and OpenAI, they all want to adopt those technologies behind the scene. And Amazon, we also working together with Amazon AWS team so that they want to support the deployment of our systems in AWS environment as well. So everyone can just one click, they can launch the systems. And they mentioned there are several.

CRob (22:55.036)
Mm-hmm.

Taesoo Kim (23:24.023)
government-backed They explicitly request to launch our CRS in their environment.

CRob (23:31.1)
I imagine so. Well, again, kudos to the team. Congratulations. It’s amazing. I love to see when researchers have these amazing creative ideas and actually are able to add actual value. And it’s great to hear that Samsung was immediately able to start to get value out of this work. And I hopefully other folks will do the same.

Taesoo Kim (23:55.18)
Yeah, exactly. I think regarding one of wisdom or general advice in general is that this competition based innovation, particularly in academic or involvement like startups or not, because of this venue, so including ourselves and startup people and other team members put their life

on this competition. It’s an objective metric, head-to-head competitions. We don’t care about your background. Just win, right? There’s your objective score. Your job is fine and fix it, I think this competition really drives a lot of efforts behind the scene in our team. We are motivated because of this entire competition is represented in broader audience. I think this is really a way to drive the innovation.

CRob (24:26.46)
Mm-hmm.

CRob (24:32.57)
Yes.

CRob (24:36.709)
Mm-hmm.

Taesoo Kim (24:54.904)
to get some public attention beyond Alphi as well. So I think we really want to see other type of competition in this space. And in the longer future, you probably see based on the current trend, CTF competitions like that, maybe not just CTF, it’s Asian-based CTF, no human involved or the Asians are now attacking each other and solving CTF challenge.

CRob (24:58.524)
Excellent.

CRob (25:19.59)
Mm-hmm.

Taesoo Kim (25:24.846)
This is not a five-year no-vote. It’s going to happen in two years or shortly. Even in this year’s live CTF, one of the teams actually leveraged Asian systems and Asians actually solved the competition quicker than humans. So think we’re going to see those types of events and breakthroughs more often than

CRob (25:55.292)
I used to be a judge at the collegiate cyber competition for one of our local schools. And I think I see a lot of interesting applicability kind of using this as to help them to teach the students that you have an aggressive attacker is doing these different techniques and it’s able to kind of apply some of these learnings that you all have. It’s really exciting stuff.

Taesoo Kim (26:00.142)
Mm-hmm.

Taesoo Kim (26:15.47)
I think one of the interesting quote from, I don’t know who actually said, but in the AI space, someone mentioned that there will be one person, one billion market cap company appear because of LLN or because of AI in general. But if you see the CTF, currently most of the team has minimum 50 people or 100 people competing each other. We’re going to see very soon.

one person or maybe five people with the help of those AI tools and they’re going to compete. Or human are just assisting AI in a way that, hey, could you bring up the Raspberry Pi for me or set up so that human just helping LLN or helping AI in general so that AI can compete. So I think we’re going to see some interesting thing happening pretty soon in our company for sure.

CRob (26:59.088)
Mm-hmm. Yeah.

CRob (27:11.804)
I agree. Well, again, Taesoo, thank you for your time. Congratulations to the team. And that is a wrap. Thank you very much.

Taesoo Kim (27:22.147)
Thank you so much.

What’s in the SOSS? Podcast #51 – S3E3 AIxCC Part 1 – From Skepticism to Success: The AI Cyber Challenge (AIxCC) with Andrew Carney

By Podcast

Summary

This episode of What’s in the SOSS features Andrew Carney from DARPA and ARPA-H, discussing the groundbreaking AI Cyber Challenge (AIxCC). The competition was designed to create autonomous systems capable of finding and patching vulnerabilities in open source software, a crucial effort given the pervasive nature of open source in the tech ecosystem. Carney shares insights into the two-year journey, highlighting the initial skepticism from experts that ultimately turned into belief, and reveals the surprising efficiency of the competing teams, who collectively found over 80% of inserted vulnerabilities and patched nearly 70%, with remarkably low compute costs. The discussion concludes with a look at the next steps: integrating these cyber reasoning systems into the open source community to support maintainers and supercharge automated patching in development workflows.

This episode is part 1 of a four-part series on AIxCC:

Conversation Highlights

00:00 – Introduction and Guest Welcome
00:59 – Guest Background: Andrew Carney’s Role at DARPA/ARPA-H
02:20 – Overview of the AI Cyber Challenge (AIxCC)
03:48 – Competition History and Structure
04:44 – The Value of Skepticism and Surprising Learnings
07:11 – Surprising Efficiency and Low Compute Costs
08:15 – Major Competition Highlights and Results
13:09 – What’s Next: Integrating Cyber Reasoning Systems into Open Source
16:55 – A Favorite Tale of “Robots Gone Bad”
18:37 – Call to Action and Closing Thoughts

Transcript

Intro music & intro clip (00:00)

CRob (00:23)
Welcome, welcome, welcome to What’s in the SOSS, the OpenSSF podcast where I talk to people that are in and around the amazing world of open source software, open source software security and AI security. I have a really amazing guest today, Andrew.

He was one of the leaders that helped oversee this amazing AI competition we’re going to talk to. So let me start off, Andrew, welcome to the show. Thanks for being here.

Andrew Carney (00:57)
Thank you for having me so much, CRob. Really appreciate it.

CRob (00:59)
Yeah, so maybe for our audience that might not be as familiar with you as I am, could you maybe tell us a little bit about yourself, kind of where you work and what types of problems are you trying to solve?

Andrew Carney (01:12)
Yeah, I’m a vulnerability researcher. That’s been the core of my career for the last 20 years. And part of that has had me at DARPA. And now I’m at DARPA and ARPA-H, where I sort of work on cybersecurity research problems focused on national defense and/or health care. So it’s sort of the space that I’ve been living in for the past few years.

CRob (01:28)
That’s an interesting collaboration between those two worlds.

Andrew Carney (01:43)
Yeah, it’s, you know, it’s, I think the vulnerability research and reverse engineering community is, pretty tight, you know, pretty, pretty small. And, a lot of folks across lots of different industries and sectors have similar problems that, you know, we’re able to help with. So, yeah, it’s, it’s exciting to kind of see, see how, how, you know, folks in finance or automotive industry or the energy sector kind of all deal with similar-ish problems, but different scales with different kind of flavors of concerns.

CRob (02:20)
That’s awesome. And so as I mentioned, we were introduced through the AIxCC competition. Maybe for our audience that might not be as familiar, could you maybe give us an overview of AIxCC, the competition, and kind of why you felt this effort was so important and we’ve spent so much time working through this, years.

Andrew Carney (02:42)
Absolutely. I mean, AIxCC, uh, is a competition to create autonomous systems that can find and patch vulnerabilities in source code. Uh, a big part of this competition was focusing on open source software, um, because of how critical it is kind of across our tech ecosystem. It really is sort of like the font of all software.

And so DARPA and ARPA-H and other partners across the federal government, we saw this kind of need to support the open source community and also leverage kind of new technologies on the scene like LLMs. So how do we take these new technologies and apply them in a very principled way to help solve this massive problem? And working with the Linux Foundation and OpenSSF has been a huge piece of that as well. So I really appreciate everything you guys have done throughout the competition.

CRob (03:41)
Thank you.

CRob (03:48)
And maybe could you give us just a little history of when did the competition start and kind of how it was structured?

Andrew Carney (03:54)
Yeah. So the competition was announced at Black Hat in August of 2023. The competition was structured into two main sections. We had a qualifying event at DEF CON in 2024. And then we had our final event this past DEF CON, August 2025. And throughout that two-year period, we designed a competition that kept pushing the competitors sort of ahead of wherever the current models, the current kind of agentic technologies were, whatever that bar they were setting, we continued to push the competitors past that. So it’s been a really dynamic sort of competition because that technology has continued to evolve.

CRob (04:44)
I have to say when I initially heard about the competition, I’ve been doing cybersecurity a very long time. I was very skeptical about what the results will be, not to bury, to bury the lead, so to speak. But I was very surprised with the results that you all shared with the world this summer in Las Vegas. We’ll get to that in a minute. But again, this competition went over many years and as it progressed, could you maybe share what you learned that maybe surprised you, you didn’t expect from when this all kicked off.

Andrew Carney (05:21)
Yeah, think so. I think there have been a lot of surprises along the way. And I’ll also say that, you know, skepticism, especially from, you know, informed experts is a really good sign for a DARPA challenge. So for a lot of projects at DARPA generally, you know, if you’re kind of waffling between this is insanely hard and there’s no way we’ll be successful and this is kind of a much easy, like, you know, there’s an easy solution to this. If you’re constantly in that space of uncertainty, like, no, I really think this is really, really hard. And I’m getting skepticism from people that know a lot about this space. For us, that’s fuel. That’s okay. There is, you know, there’s a question to answer here. And so that really was part of driving us, even competitors, competitors that ended up making it to finals themselves were skeptical even as they were competing.

So I love that. I love that. Like, you know, we want to try to do really hard things and, you know, criticism helps us improve. Like that’s super beneficial.

CRob (06:33)
Yeah, it was, and I’ve had the opportunity to talk with many of the teams and now we’re in the phase post-competition where we’re actually starting to figure out how to share the results with the upstream projects and how to build communities around these tools. you assembled a really amazing group of folks in these competitive teams, some super top-notch minds. again,

You made me a believer now, where I really do believe that AI does have a place and can legitimately offer some real value to the world in this space.

Andrew Carney (07:11)
Yeah, think one of the biggest surprises for me was the efficiency. I think a lot of times, especially with DARPA programs, we expect that technical miracles will come with a pretty hefty price tag. And then you’ll have to find a way to scale down, to economize, to make that technology more useful, more more widely kind of distributable.

With AIxCC, we found the teams pushing so hard on the core kind of research questions, but at the same time, sort of woven into that was using their resources efficiently. And so even the competition results themselves were pleasantly surprising in terms of the compute costs for these systems to run. We’re talking tens to hundreds of dollars.

vulnerability discovered or patch emitted, which is really quite amazing.

CRob (08:15)
Yeah, so maybe could you just give me some highlights of kind of what the competition discovered, what the competitors achieved?

Andrew Carney (08:24)
Yeah. So I think when we’re trying to tackle these really challenging research questions and we’re examining it from all angles and being extremely critical of even our own approach, as well as the competitors’ approaches, that initially back in August of 2024, we had this amazing proof of life moment where the teams demonstrated with only a few hundred dollars in total compute budget.

that they were able to analyze large open source projects and find real issues. One of the teams found a real issue in SQLite that we had disclosed at the time to the maintainers. And they found that, once again, with this very limited compute budget across multiple millions of lines of code in these projects. So that was sort of the OK, there’s a there there, like there’s something here and we can keep pushing. So that was a really exciting moment for everyone. And then over the following year, up to August 2025, we had a series of these non-scoring events where the teams would be given challenges that looked very similar to what we’d give them for finals with an increasing level of scale and difficulty.

So you can think of these as like extreme integration events where we’re still giving the teams hundreds of thousands or millions of lines of code. We’re giving them, you know, eight to 12 hours per kind of task. And we’re seeing what they can do. This was important to ensure that the final competition went off without a hitch. And also because the models they were leveraging continue to evolve and change.

So it was really exciting. In that process, the teams found and disclosed hundreds of vulnerabilities and produced hundreds of potential patches that they would offer up to maintainers of the projects that they were doing their own internal kind of development on. So that was really exciting just to see that the SQLite bug wasn’t a fluke and that the teams could consistently kind of perform and keep pushing as we push them to move further and faster and deal with more complex code, they were able to adapt and find a way forward.

CRob (11:02)
That’s awesome. And I know you had, it was a long journey that you and the team and all the support folks went through, but is there any particular moment that kind of you smile on when you reflect on over the course of the competition?

Andrew Carney (11:20)
Oh, man, so many. I think there’s an equal number of like those smiling moments and also, you know, premature gray hairs that the team and myself have created. But I think one of the big moments, there were a number of just outstanding kind of experts in the field on social media.

in talks that would, the way that they talked about kind of AI powered program analysis was very skeptical. near the end, leading up to semi-finals, we had this lovely moment where the Google project zero team and the Google deep mind teams penned a blog post that said that they were inspired by one of the teams, by the SQL light bug, by one of the team’s discoveries. And that was huge, I think both for that team and just the competition as a whole. And then after that, seeing people’s opinions change and seeing people that had held, that were, like I said, top tier experts in the field, change their perspective pretty drastically, which that was, you know, that was helpful signal for us to demonstrate that we were being successful. Like converting a critic, I think, is one of the best kind of victories that you can have. Because now they can be a collaborator, right? Like now we can still kind of spar over different perspectives or ideas, but now we’re working together. That’s very exciting.

CRob (13:09)
That’s awesome. So what’s next? The hard work of the competition is over and now we’re in kind of the after action phase where we’re trying to integrate all this great work and kind of get these projects out to the world to use. So from your perspective or from DARPA or the competition, what’s next for you?

Andrew Carney (13:29)
Yeah, so one of the biggest challenges with DARPA programs is when you’re successful, sometimes you have that technological miracle, you have that accomplishment, and maybe the world’s not entirely ready for it yet. Or maybe there’s additional development that needs to happen to get it kind of into the real world. With AIxCC, we made the competition as realistic as possible. The automated systems, these cyber reasoning systems, were being given bug reports, they’re being given patch diffs, they’re being given artifacts that we would consume and review as human developers. So we modeled all the tasks very closely to the real things that we would want these systems to do. And they demonstrated incredible kind of performance. Collectively, the teams were able to find over 80 % of the vulnerabilities that we’d synthetically kind of inserted. And they patched nearly 70 % of those vulnerabilities. And that patching piece is so critical. What we didn’t want to do was create systems that made open source maintainers lives more problematic.

CRob (14:54)
Thank you.

Andrew Carney (14:56)
We wanted to demonstrate that this is a reachable bug and here’s a candidate patch. And in the months after the competition, we’ve incentivized the teams further than just the original prize money to go out into the open source community and support open source maintainers with their tools. And we’ve had folks come back and literally in their kind of reports, document that the patch they suggested to a maintainer was nearly identical to what the maintainer actually committed. Yeah. And those reports are coming in daily. So we’re getting, we have this constant feed of engagement and the tools are still obviously being improved and developed. But it’s really exciting to see it. So when I think about what’s next is like we’re already in the what’s next like getting the technology out there, using government funding to support open source maintainers wherever we can, especially if their code is part of widely used applications or code used in critical infrastructure. So that’s where we find ourselves now. And then we’re thinking a lot about how we supercharge that effort to the…

there have been, you the federal government supports a lot of actively used open source projects, right? And we’ve been working with all these partner agencies across the federal government and just making sure that we’re supporting the existing programs when we find them. And then where we see a gap, kind of figuring out what it would take to fill that gap that community that could use more support.

CRob (16:55)
So on a slightly different note, we’re both technologists and we love the field, but as I was going through this journey, kind of on the sidelines with you all, I was reflecting, do you have a a favorite tale of robots gone bad? Like Terminator’s Skynet or HAL 9000 or the Butlerian Jihad?

Andrew Carney (17:22)
That’s a, you know, I think I, I’ll, I don’t know that this is my favorite, but it is one of the most recent ones that I’ve read. There’s a series called Dungeon Crawler Carl. Yeah. And it’s been really like entertaining reading. And I just think the tension between the primal AIs and the corporations that rely on said independent entities, but also are constantly trying to rein them in is, I don’t know, it’s been really interesting to see that narrative evolve.

CRob (18:08)
I’ve always enjoyed science fiction and fantasy’s ability to kind of hold a mirror up to society and kind of put these questions in a safe space where you can kind of think about 1984 and Big Brother or these other things, but it’s just in paper or on your iPad or whatever. So it’s a nice experiment over there. And we don’t want that to be happening here.

Andrew Carney (18:29)
Yes, yes. Yeah, the fiction as thought experimentation, right?

CRob (18:37)
Right, exactly. So as we wind down, do you have a particular call to action or anything you want to highlight to the audience that they should maybe investigate a little further or participate in?

Andrew Carney (18:50)
Yeah, I think so a big one is, you know, we would love for open source maintainers to reach out to us directly. AIXCC at DARPA.mil. That’s the email address that our team uses. And we’ve been looking for more maintainers to connect with so that we can make sure that if we can provide resources to them, one, that they’re right sized for the challenges that those maintainers are having, or maintainer, right? Sometimes it’s just one person. And then two, that we’re engaging with them in the way that they would prefer to be engaged with. We want to be helpful help, not unhelpful help. So that’s a big one. And then I think in more generally, I would love to see more patching added into the kind of vulnerability research lifecycle. I think there’s so many opportunities for commercial and open source tools that have that discovery capability and that’s really their big selling point. And now with AIxCC and with the technology that the competitors open source themselves, since all of their systems were open sourced after the competition, there’s this real potential, I think that we haven’t seen it realized the way that it really could be. And so that’s, I would love to see more of that kind of automated patching added to tools and kind of development workflows.

CRob (20:29)
I’ll say my personal favorite experience out of all this is now that the competition, the minute the competition was over, then there was an ethical wall up between, you your administrators and us and the different competition teams. But now I’ve, we’ve observed the competitors, like looking at each other’s work and asking questions to each other and collaborating. that is, I’m so super excited to see what comes next. Now that all these smart people have proven themselves. and they found kind of connected spirits and they’re gonna start working together for even more amazing things.

Andrew Carney (21:07)
Absolutely. I think we’re expecting a state of knowledge paper with all the teams as authors. That’s something they’ve organized independently, to your point. And yeah, I cannot wait to see what they come out with collaboratively.

CRob (21:23)
Yeah. And anyone that’s interested to learn more or potentially directly interact with some of these competition experts, whether they’re in academia or industry, the OpenSSF is sponsoring as part of our AI ML working group. We’ve created a cyber reasoning special interest group specifically for the competition, all the competitors, and just to have public discussions and collaboration around these things. And we would invite everybody to show up and listen and participate as they feel comfortable and learn.

Well, Andrew and the whole DARPA and ARPA-H team, everyone that was involved in the competition, thank you. Thank you to our competitors. And we actually are going to have a series of podcasts talking to the individual competitors, kind of learning a little bit of the unique flavors and challenges these had. But thank you for sponsoring this and kind of really delivering something I think is going to have a ton of utility and value to the ecosystem.

Andrew Carney (21:47)
Thank you for working with us on this journey and we definitely look forward to more collaboration in the future.

CRob (21:54)
Well, and with that, we’ll wrap it up. I just want to tell everybody happy open sourcing. We’ll talk to you soon.

OpenSSF at DEF CON 33: AI Cyber Challenge (AIxCC), MLSecOps, and Securing Critical Infrastructure

By Blog

By Jeff Diecks

The OpenSSF team will be attending DEF CON 33, where the winners of the AI Cyber Challenge (AIxCC) will be announced. We will also host a panel discussion at the AIxCC village to introduce the concept of MLSecOps.

AIxCC, led by DARPA and ARPA-H, is a two-year competition focused on developing AI-enabled software to automatically identify and patch vulnerabilities in source code, particularly in open source software underpinning critical infrastructure.

OpenSSF is supporting AIxCC as a challenge advisor, guiding the competition to ensure its solutions benefit the open source community. We are actively working with DARPA and ARPA-H to open source the winning systems, infrastructure, and data from the competition, and are designing a program to facilitate their successful adoption and use by open source projects. At least four of the competitors’ Cyber Resilience Systems will be open sourced on Friday, August 8 at DEF CON. The remaining CRSs will also be open sourced soon after the event.

Join Our Panel: Applying DevSecOps Lessons to MLSecOps

We will be hosting a panel talk at the AIxCC Village, “Applying DevSecOps Lessons to MLSecOps.” This presentation will delve into the evolving landscape of security with the advent of AI/ML applications.

The panelists for this discussion will be:

  • Christopher “CRob” Robinson – Chief Security Architect, OpenSSF
  • Sarah Evans – Security Applied Research Program Lead, Dell Technologies
  • Eoin Wickens – Director of Threat Intelligence, HiddenLayer

Just as DevSecOps integrated security practices into the Software Development Life Cycle (SDLC) to address critical software security gaps, Machine Learning Operations (MLOps) now needs to transition into MLSecOps. MLSecOps emphasizes integrating security practices throughout the ML development lifecycle, establishing security as a shared responsibility among ML developers, security practitioners, and operations teams. When thinking about securing MLOps using lessons learned from DevSecOps, the conversation includes open source tools from OpenSSF and other initiatives, such as Supply-Chain Levels for Software Artifacts (SLSA) and Sigstore, that can be extended to MLSecOps. This talk will explore some of those tools, as well as talk about potential tooling gaps the community can partner to close. Embracing this methodology enables early identification and mitigation of security risks, facilitating the development of secure and trustworthy ML models.  Embracing MLSecOps methodology enables early identification and mitigation of security risks, facilitating the development of secure and trustworthy ML models.

We invite you to join us on Saturday, August 9, from 10:30-11:15 a.m. at the AIxCC Village Stage to learn more about how the lessons from DevSecOps can be applied to the unique challenges of securing AI/ML systems and to understand the importance of adopting an MLSecOps approach for a more secure future in open source software.

About the Author

JeffJeff Diecks is the Technical Program Manager for the AI Cyber Challenge (AIxCC) at the Open Source Security Foundation (OpenSSF). A participant in open source since 1999, he’s delivered digital products and applications for dozens of universities, six professional sports leagues, state governments, global media companies, non-profits, and corporate clients.