What’s in the SOSS? Podcast #52 – S3E4 AIxCC Part 2 – From Skeptics to Believers: How Team Atlanta Won AIxCC by Combining Traditional Security with LLMs

Summary

In this 2nd episode in our series on DARPA’s AI Cyber Challenge (AIxCC), CRob sits down with Professor Taesoo Kim from Georgia Tech to discuss Team Atlanta’s journey to victory. Kim shares how his team – comprised of academics, world-class hackers, and Samsung engineers – initially skeptical of AI tools, underwent a complete mindset shift during the competition. He shares how they successfully augmented traditional security techniques like fuzzing and symbolic execution with LLM capabilities to find vulnerabilities in large-scale open source projects. Kim also reveals exciting post-competition developments, including commercialization efforts in smart contract auditing and plans to make their winning CRS accessible to the broader security community through integration with OSS-Fuzz.

This episode is part 2 of a four-part series on AIxCC:

Listen on Apple Podcasts Listen on Spotify Listen on Overcast Listen on Pocket Casts

Conversation Highlights

00:00 – Introduction
00:37 – Team Atlanta’s Background and Competition Strategy
03:43 – The Key to Victory: Combining Traditional and Modern Techniques
05:22 – Proof of Vulnerability vs. Finding Bugs
06:55 – The Mindset Shift: From AI Skeptics to Believers
09:46 – Overcoming Scalability Challenges with LLMs
10:53 – Post-Competition Plans and Commercialization
12:25 – Smart Contract Auditing Applications
14:20 – Making the CRS Accessible to the Community
16:32 – Student Experience and Research Impact
20:18 – Getting Started: Contributing to the Open Source CRS
22:25 – Real-World Adoption and Industry Impact
24:54 – The Future of AI-Powered Security Competitions

Episode Links

Transcript

Intro music & intro clip (00:00)

CRob (00:10.032)
All right, I’m very excited to talk to our next guest. I have Taesoo Kim, who is a professor down at Georgia Tech, also works with Samsung. And he got the great opportunity to help shepard Team Atlanta to victory in the AIxCC competition. Thank you for joining us. It’s a really pleasure to meet you.

Taesoo Kim (00:35.064)
Thank you for having me.

CRob (00:37.766)
So we were doing a bunch of conversations around the competition. I really want to showcase like the amazing early cutting edge work that you and the team have put together. So maybe, can you tell us what was your team’s approach? What was your strategy as you were kind of approaching the competition?

Taesoo Kim (00:59.858)
that’s a great question. Let me start with a little bit of a background.

CRob (00:)
Please.

Taesoo Kim (00:59)
Ourself, our team, Atlanta, we are multiple group of people in various backgrounds, including me as academics and researchers in security area. We also have world-class hackers in our team and some of the engineers from Samsung as well. So we have a little bit of background in various areas so that we bring our expertise.

Taesoo Kim (01:29.176)
to compete in this competition. It’s a two-year journey. We put a lot of effort, not just engineering side, we also tinkled with lot of research approach that we’ve been working on this area for a while. Said that, I think most important strategy that our team took is that, although it’s an AI competition…

CRob (01:51.59)
Mm-hmm.

Taesoo Kim (01:58.966)
…meaning that they promote the adoption of LLM-like techniques, we didn’t simply give up in traditional analysis technique that we are familiar with. It means we put a lot of effort to improve, like fuzzing is one of the great dynamic testing for finding vulnerability, and also traditional techniques like symbolic executions and concocted executions, even directed fuzzing. Although we suffer from lot of scalability issues in those tools, because one of themes of AIxCC is to find bugs in the real world.

And large-scale open source project. It means most of the traditional techniques do not scale in that level. We can analyze one function or a small number of code in the source code repository when it comes to, for example, Linux or Nginx. This is crazy amount of source code. Like building a whole graph in this gigantic repository itself is extremely hard. So that we start augmenting LLM in our pipeline.

One of the great examples of fuzzing is that when we are mutating input, although we leverage a lot of mutation techniques in the fuzzing side, we also leverage the understanding of LLM in a way that LLM also navigates the possibility of mutating places in the source code in a way that they can generate some of the dictionaries, providing vocabulary for fuzzer, and realize the input format that they have to mutate as well. So lot of augmentations of using LLM happen all over the places in traditional software analysis technique that we are doing.

CRob (03:43.332)
And do you feel that combination of using some of the newer techniques and fuzzing and some of the older, more traditional techniques, do you think that that was what was kind of unique and helped push you over the victory line and the cyber reasoning challenge?

Taesoo Kim (04:01.26)
It’s extremely hard to say which one contributed the most during the competition. But I want to emphasize that finding bugs in the location of the source code versus formulating input that trigger those vulnerability in our competition, what we call as proof of vulnerability. These two tasks are completely different. You can identify many bugs.

But unfortunately, in order to say this is truly the bug, you have to prove by yourself by showing or constructing the input that triggered the vulnerability. The difficulty of both tasks are, I would say people do not comprehend the challenges of formulating input versus finding a vulnerability in the source code. You can pinpoint without much difficulty the various places in the source code.

But in fact, that’s an easier job. In practice, more difficult challenge is identifying the input that actually reach the place that you like and trigger the vulnerability as a result. So we spend much more time how to construct the input correctly to show that we really prove the existence of vulnerability in the source.

CRob (05:09.692)
Mm-hmm.

CRob (05:22.94)
And I think that’s really a key to the competition as it happened versus just someone running LLM and scanners kind of randomly on the internet is the fact that you all were incented to and required to develop that fix and actually prove that these things are really vulnerable and accessible.

Taesoo Kim (05:33.718)
Exactly.

Taesoo Kim (05:42.356)
Exactly. That also highlights what practitioners care about. So you ended up having so many false positives in the security tools. No one cares. There are a of complaints about why we are not using security tools in the first place. So this is one of the important criteria of the competition. one of the strengths in traditional tools like buzzer and concord executor, everything centers around to reduce the false positives. The region people.

CRob (05:46.192)
Yes.

Taesoo Kim (06:12.258)
take Fuzzer in their workflow. So whenever Fuzzer says there is a vulnerability, indeed there is a vulnerability. There’s a huge difference. So that we start with those existing tool and recognize the places that we have to improve so that we can really scale up those traditional tool to find vulnerability in this large scale software.

CRob (06:36.568)
Awesome. As you know, the competition was a marathon, not a sprint. So you were doing this for quite some time. But as the competition was progressing, was there anything that surprised you in the team and kind of changed your thinking about the capabilities of these tools?

Taesoo Kim (06:51.502)
Ha

Taesoo Kim (06:55.704)
So as I mentioned before, we are hackers. We won Defqon CTF many times and we also won F1 competition in the past. So by nature, we are extremely skeptical about AI tool at the beginning of the competition. Two years ago, we evaluated every single existing LLM services with the benchmark that we designed. We realized they are all not usable at all.

CRob (07:09.85)
Mm-hmm.

Taesoo Kim (07:24.33)
not appropriate for the competition. Instead of spending time on improving those tools, which we felt like inferior at the beginning, so our motto at that time, our team, don’t touch those areas. We’re going to show you how powerful these traditional techniques are. So that’s why we progressed the semi-final. We did pretty well. We found many of the bugs by using all the traditional tools that we’ve been working on. But like…

Immediately after semifinal, everything changed. We reevaluated the possibility of adopting LLM. At that time, just removing or obfuscating some of the tokens in the repository, the LLM couldn’t even reason anything about. But suddenly, near or around semifinal, something happened. We realized that even after we inject or

If you think of it this way, there is a token, and you replace this token with meaningless words. LLM previously all confused about all these synthetic structures of the source code, but now, or on semifinal, they really understand. Although we tried to fool many times, you really catch up the idea, which is a source code that they never saw before, never used in the training, because we intentionally create this source code for the evaluation.

We start realizing that we actually understand. We shock everybody. So we start realizing that there are so many places, if that’s the case, there are so many places that we can improve. Right? So that’s the moment that we change our mindset. So now everything about LLM, everything about the new Asian architectures, so that we ended up putting humongous amount of efforts creating various architectures of Asian design that we have.

Also, we replaced some of software analysis techniques with LLM as well, surprisingly. For example, symbolic execution is a good example. It’s extremely hard to scale. Whenever you execute one instruction at a time, you have to create the constraint around them. But one of the big challenges in real-world software, there are so many, I would say, hard-to-analyze functions exist. Meaning that, for example, there is a

Taesoo Kim (09:46.026)
Even NGINX as an example, we thought that they probably compared the string to string at a time. But the way they perform string compare in NGINX, they map this string or do the hashing so that they can compare the hash value. Fudger, another symbolic executor, is extremely bad at those. If you hit one hashing function, you’re screwed. There are so many constraints that there is no way we can revert back by definition.

There’s no way. But if you think about how to overcome these situations by using LLM, the LLM can recognize that this is a hashing function. We don’t actually have to create a constraint around, hey, what about we replace with identity functions? It’s something that we can easily divert by using symbolic execution. So then we start recognizing the possibility of LLM role in the symbolic execution. Now see that.

Smaller execution can scale to the large software right now. So I think this is a pretty amazing outcome of the competition.

CRob (10:53.11)
Awesome. So again, the competition completed in August. So what plans do you have? What plans does the team have for your CRS now that the competition’s over?

Taesoo Kim (10:58.446)
Thank

Taesoo Kim (11:02.318)
I think that’s a great question. Many of tech companies approach our team. Some of them recently joined, other big companies. And many of our students want to quit the PhD program and start a company. For good reasons, right?

CRob (11:14.848)
I bet.

Taesoo Kim (11:32.766)
One of the team, my four PhD students recently formed and looking for commercialization opportunity. Not in the traditional cyber infrastructure we are looking at through the DARPA, but they spotted the possibility in smart contracts. that smart contracts and modernized financial industries like stable coins and whatnot

where they can apply the AI XTC like techniques in finding vulnerability in those areas. So that instead of analyzing everything by human auditor, you can analyze everything by using LLM or agents and similar techniques that we developed for AI XTC so that you can reduce the auditing time significantly. In order to get some auditing in the smart contract, traditionally you have to wait for two weeks.

In the worst case, even months with a ridiculous amount of cost. Typically, in order to get one auditing for the smart contract, $20,000 or $50,000 per case. But in fact, you can reduce down the amount of auditing time by, I’ll say, a few hours by day. This speed, the potential benefit of achieving this speed is you really open up

CRob (12:40.454)
Mm-hmm.

CRob (12:47.836)
Wow.

Taesoo Kim (12:58.186)
amazing opportunity in this area. So you can automate the auditing, you can increase the frequency of auditing in the smart contract area. Not only that we thought there is a possibility for even more like compliance checkings of the smart contracts, there’s so many opportunities that we can play immediately by using ARCC systems. That’s the one area that we’re looking at. Another one is more traditional area.

CRob (13:00.347)
Mm-hmm.

Taesoo Kim (13:25.07)
what we call cyber infrastructure, like hospitals and some government sectors. They really want to analyze, but unfortunately, or fortunately though, there are other opportunities that in ARCC, we analyze everything by source code, but they don’t have access to them. So we are creating the pipeline that given a binary or execution only environment, how to convert them.

CRob (13:28.828)
Mm-hmm.

CRob (13:38.236)
Mm-hmm.

CRob (13:49.569)
Taesoo Kim (13:52.416)
in a way that we can still leverage the existing infrastructure that we have for AICC. More interestingly, they don’t have access to the internet when they’re doing pen testings or analyzing those, so that we start incorporating some of our open source model as part of our systems. These are two commercialization efforts that we’re thinking and many of my students are currently

CRob (13:57.67)
That’s very clever.

CRob (14:05.5)
Yeah.

CRob (14:13.564)
It’s awesome.

CRob (14:20.366)
And I imagine that this is probably amazing source material for dissertations and the PhD work, right?

Taesoo Kim (14:29.242)
Yes, yes. Last two years, we are purely focused on ARCC. Our motto is that we don’t have time for publication. It’s just win the competition. Everything is coming after. This is the moment that we actually, I think we’re going to release our Tech Report. It’s over 150 pages. Next week, around next week. So we have a draft right now, but we are still publishing.

CRob (14:39.256)
Yeah.

CRob (14:51.94)
Wow.

Taesoo Kim (14:58.51)
for publication so that other people not just like source code okay that’s great but you need some explanation why you did this many of the sources is for the competition right so that the core pieces might be a little bit different for like daily usage of normal developers and operator so we kind of create a condensed technical material for them to understand

Not only that, we have a plan to make it more accessible, meaning that currently our CRS implementation tightly bound to the competition environment. Meaning that we have a crazy amount of resources in Azure side, everything is deployed and better tested. But unfortunately, most of the people, including ourselves, we don’t have resources. Like the competition have about

80,000 cloud credit that we have to use. So no one has that kind of resource. It’s not like that, not if you’re not a company. But we want to apply this one for your project in the smaller scale. That’s what we are currently working on. So discarding all these competition dependent parameter from the source code, making more containable so that you can even launch our CRS in your local environment.

This is one of the big, big development effort that we are doing right now in our lab.

CRob (16:32.155)
That’s awesome. take me a second and thinking about this from the students perspective that participated. What kind of an experience was it getting to work with professors such as yourself and then actual professional researchers and hackers? What do you see the students are going to take away from this experience?

Taesoo Kim (16:53.846)
I think exposing to the latest model because we are tightly collaborating with this OpenAI and Gemini, we are really exposed to those latest model. If you’re just working on the security, not tightly working for LLM, you probably don’t appreciate that much. But through the competition, everyone’s mindset change. And then we spend time.

and deeply take a look in what’s possible, what’s not, we now have a great sense of what type of problem we have to solve, even in the research side. And now, suddenly, after this competition, every single security project, security research that we are doing at Georgia Tech is based on LLF. Even more surprising to hear that we have some decompilation project that we are doing, the traditional possible security research you can read.

CRob (17:42.448)
Ha ha.

Taesoo Kim (17:52.162)
binary analysis, malware analysis, decompilations, crash analysis, whatnot. Now everything is LLM. Now we realize LLM is much better at decompiling than traditional tools like IDEA and Jydra. So I think these are the type of research that we previously thought impossible. We’re probably not even thinking about applying LLM. Because we spend our lifetime working on decompiling.

CRob (17:53.68)
Mm.

CRob (17:59.068)
Yeah.

Taesoo Kim (18:22.318)
But at a certain point, we realized that LLM is just doing better than what we’ve been working on. Just one day. It’s a complete mind change. In traditional program analysis perspective, many things are empty completely. There’s no way you can solve it in an easier way. So they’re not spending time. That’s our typical mindset. But now, it works in practice, amazingly.

CRob (18:29.574)
Yeah.

Taesoo Kim (18:51.807)
how to improve what we thought previously impossible by using another one. It’s the key.

CRob (18:57.404)
That’s awesome. It’s interesting, especially since you stated initially when you went into the competition, you were very skeptical about the utility of LLMs. So that’s great that you had this complete reversal.

Taesoo Kim (19:04.238)
Thank

Yeah, but I think I like to emphasize one of the problems of LLM though, it’s expensive, it’s slow in traditional sense, you have to wait a few seconds or a few minutes in certain cases like reasoning model or whatnot. So tightly binding your performance with this performance lagging component in the entire systems is often challenging.

CRob (19:17.648)
Yes.

CRob (19:21.82)
Mm-hmm.

Taesoo Kim (19:39.598)
and then just talking. But another benefit of everything is text. There’s no proper API, just text. There’s no sophisticated way to leverage it, just text. I don’t know, you’re probably familiar with all these security issues, potentially with unstructured input. It’s similar to cross-site scripting in the web space. There’s so many problems you can imagine.

CRob (19:51.984)
Okay, yeah.

CRob (20:01.979)
Mm-hmm.

Taesoo Kim (20:08.11)
But as far as you can use in a well-contained manner in the right way, we believe there are so many opportunities we can get from it.

CRob (20:18.876)
Great. So now that your CRS has been released as open source, if someone from our community was interested in joining and maybe contributing to that, what’s the best way somebody could get started and get access?

Taesoo Kim (20:28.494)
Mm-hmm.

So we’re going to release non-competition version very soon, along with several documents, we call standardization effort that we and other teams are doing right now. So we define non-competition CRS interface so that you can tightly, as far as you implement those interface, our goal is to mainstream OSS browser together with Google team.

CRob (20:36.369)
Mm-hmm.

CRob (20:58.524)
Mm-hmm.

Taesoo Kim (20:59.086)
so that you can put your CRS as part of OSS Fuzz mainstream, so that we can make it much easier, so that everyone can evaluate one at a time in their local environment as part of OSS Fuzz project. So we’re gonna release the RFC document pretty soon through our website, so that everyone can participate and share their opinion, what are the features that they think we are missing, that we’d love to hear about.

CRob (21:03.74)
Thanks.

CRob (21:18.001)
Mm-hmm.

Taesoo Kim (21:26.502)
And then after that, a month period, we’re going to release our local version so that everyone can start using. And with a very permissive license, everyone can take advantage of the public research, including companies.

CRob (21:34.78)
Awesome.

CRob (21:42.692)
It’s, I’m just amazed. when I came into this, partnering with our friends at DARPA, I was initially skeptical as well. And as I was sitting there watching the finals announced, it was just amazing. Kind of this, the innovative innovation and creativity that all the different teams displayed. again, congratulations to your team, all the students and the researchers and everyone that participated.

Taesoo Kim (21:59.79)
Mm-hmm.

CRob (22:12.6)
Well done. Do you have any parting thoughts? know, as you’re think, as we move on, do you have any kind of words of wisdom you want to share with the community or any takeaways for people curious to get in this space?

Taesoo Kim (22:25.486)
Oh, regarding commercialization, one thing I also like to mention is that in Samsung, we already took the open source version of the CRS, start applying the internal project and open source Samsung project immediately after. So we started seeing the benefit of applying the CRS in the real world immediately after the competition. A lot of people think that competition is just for competition or show

CRob (22:38.108)
Mm-hmm.

Taesoo Kim (22:55.032)
But in fact, it’s not. Everyone in industry, including at Tropic Meta and OpenAI, they all want to adopt those technologies behind the scene. And Amazon, we also working together with Amazon AWS team so that they want to support the deployment of our systems in AWS environment as well. So everyone can just one click, they can launch the systems. And they mentioned there are several.

CRob (22:55.036)
Mm-hmm.

Taesoo Kim (23:24.023)
government-backed They explicitly request to launch our CRS in their environment.

CRob (23:31.1)
I imagine so. Well, again, kudos to the team. Congratulations. It’s amazing. I love to see when researchers have these amazing creative ideas and actually are able to add actual value. And it’s great to hear that Samsung was immediately able to start to get value out of this work. And I hopefully other folks will do the same.

Taesoo Kim (23:55.18)
Yeah, exactly. I think regarding one of wisdom or general advice in general is that this competition based innovation, particularly in academic or involvement like startups or not, because of this venue, so including ourselves and startup people and other team members put their life

on this competition. It’s an objective metric, head-to-head competitions. We don’t care about your background. Just win, right? There’s your objective score. Your job is fine and fix it, I think this competition really drives a lot of efforts behind the scene in our team. We are motivated because of this entire competition is represented in broader audience. I think this is really a way to drive the innovation.

CRob (24:26.46)
Mm-hmm.

CRob (24:32.57)
Yes.

CRob (24:36.709)
Mm-hmm.

Taesoo Kim (24:54.904)
to get some public attention beyond Alphi as well. So I think we really want to see other type of competition in this space. And in the longer future, you probably see based on the current trend, CTF competitions like that, maybe not just CTF, it’s Asian-based CTF, no human involved or the Asians are now attacking each other and solving CTF challenge.

CRob (24:58.524)
Excellent.

CRob (25:19.59)
Mm-hmm.

Taesoo Kim (25:24.846)
This is not a five-year no-vote. It’s going to happen in two years or shortly. Even in this year’s live CTF, one of the teams actually leveraged Asian systems and Asians actually solved the competition quicker than humans. So think we’re going to see those types of events and breakthroughs more often than

CRob (25:55.292)
I used to be a judge at the collegiate cyber competition for one of our local schools. And I think I see a lot of interesting applicability kind of using this as to help them to teach the students that you have an aggressive attacker is doing these different techniques and it’s able to kind of apply some of these learnings that you all have. It’s really exciting stuff.

Taesoo Kim (26:00.142)
Mm-hmm.

Taesoo Kim (26:15.47)
I think one of the interesting quote from, I don’t know who actually said, but in the AI space, someone mentioned that there will be one person, one billion market cap company appear because of LLN or because of AI in general. But if you see the CTF, currently most of the team has minimum 50 people or 100 people competing each other. We’re going to see very soon.

one person or maybe five people with the help of those AI tools and they’re going to compete. Or human are just assisting AI in a way that, hey, could you bring up the Raspberry Pi for me or set up so that human just helping LLN or helping AI in general so that AI can compete. So I think we’re going to see some interesting thing happening pretty soon in our company for sure.

CRob (26:59.088)
Mm-hmm. Yeah.

CRob (27:11.804)
I agree. Well, again, Taesoo, thank you for your time. Congratulations to the team. And that is a wrap. Thank you very much.

Taesoo Kim (27:22.147)
Thank you so much.

What’s in the SOSS? Podcast #52 – S3E4 AIxCC Part 2 – From Skeptics to Believers: How Team Atlanta Won AIxCC by Combining Traditional Security with LLMs

Summary

Conversation Highlights

Episode Links

Transcript

We envision a future where OSS is universally trusted, secure, and reliable. Join us in making open source more secure.

Get the latest announcements, event info, and the community news in your inbox