Summary
In the third episode of our AI Cyber Challenge (AIxCC) series, CRob sits down with Michael Brown, Principal Security Engineer at Trail of Bits, to discuss their runner-up cybersecurity reasoning system, Buttercup. Michael shares how their team took a hybrid approach – combining large language models with conventional software analysis tools like fuzzers – to create a system that exceeded even their own expectations. Learn how Trail of Bits made Buttercup fully open source and accessible to run on a laptop, their commitment to ongoing maintenance with prize winnings, and why they believe AI works best when applied to small, focused problems rather than trying to solve everything at once.
This episode is part 3 of a four-part series on AIxCC:
Conversation Highlights
00:04 – Introduction & Welcome
00:12 – About Trail of Bits & Open Source Commitment
03:16 – Buttercup: Second Place in AIxCC
04:20 – The Hybrid Approach Strategy
06:45 – From Skeptic to Believer
09:28 – Surprises & Vindication During Competition
11:36 – Multi-Agent Patching Success
14:46 – Post-Competition Plans
15:26 – Making Buttercup Run on a Laptop
18:22 – The Giant Check & DEF CON
18:59 – How to Access Buttercup on GitHub
21:37 – Enterprise Deployment & Community Support
22:23 – Closing Remarks
Transcript
CRob (00:04.328)
And next up, we’re talking to Michael Brown from Trail of Bits. Michael, welcome to What’s in the SOSS.
Michael Brown (ToB) (00:10.688)
Hey, thanks for having me. I appreciate being here.
CRob (00:12.7)
We love having you. So maybe could you describe a little bit about your organization you’re coming from, Trail of Bits, and maybe share a little insight into what your open source origin story is.
Michael Brown (ToB) (00:23.756)
Yeah, sure. So Trail of Bits is a small business. We’re a security R &D firm. We’ve been in existence since about 2012. I’ve personally been with the company about four years plus. I work there within our research and engineering department. I’m a principal security engineer, and I also lead up our AIML security research team. So Trail of Bits, we do quite a bit of government research. We also work for commercial clients.
And one of the common threads in all of the work that we do, not just government, not just commercial, is that we try to make it as public as much as we possibly can allow. So for example, sometimes, you know, we work on sensitive research programs for the government and they don’t let us make it public. Sometimes our commercial clients don’t want to publicize the results of every security audit, but to the maximum extent that our clients allow us to, we make our tools, we make our findings, we make them open source. And we’re really big believers in
that the work that we do should be a rising tide that raises all ships when it comes to the security posture for the critical infrastructure that we all depend upon, whether we’re working on hobbies at home and whether we’re building things for large organizations, all that stuff.
CRob (01:37.32)
love it. And how did you get into open source?
Michael Brown (ToB) (01:42.146)
Honestly, I’ve just kind of always have been there. So realistically, you know, the open source community is where a lot of the research tools that I started out my research career. That’s where you found them. So I started off a bit in academia. I got my undergrad in computer science and then went and did something completely different for eight years. And then when I kind of
Uh, you know, for context, I joined the military. flew helicopters for like eight years and basically nothing in computing. But as I was starting to get out, um, the army, I, know, I was getting married, about to have kids. I kind of decided I wanted to be, you know, around the house a little bit more often. um, you know, I started getting a master’s degree at Georgia Tech. They’re offering it online. And then after I did that, I went to go, um, do a PhD there and also work for their, um, uh, their applied research arm, Georgia Tech Research Institute.
So lot of the work that I was doing was, you know, cutting edge work on software analysis, compilers and AI ML. And a lot of the stuff that I, you know, built the tools that I did my research on, they came from the open source community. They were tools that were open sourced as part of the publication process for academic work. They were made publicly and available open source by companies like Trail of Bits before I came to work with them as the result of government research projects.
So, honestly, I guess I don’t really have much of an origin story for when I got there. I kind of just landed there when I started my career in security research and just stayed.
CRob (03:16.814)
Everybody has a different journey that gets us here. And interestingly enough, you mentioned our friends at Georgia Tech, which was a peer competitor of yours in the AICC competition, which you all on Trail of Bits team, I believe your project was called Buttercup. And you came in second place. You had some amazing results with your work. So maybe could you tell us a little bit about the…
Michael Brown (ToB) (03:33.741)
Yeah, that’s correct.
CRob (03:43.15)
What you did is part of the AI CC competition and kind of how your team approached this.
Michael Brown (ToB) (03:51.022)
Yeah. So, um, you know, at the risk of sounding a bit like a hipster, um, I’ve been working at the intersection of software security, compiler, software analysis, AI ML for, you know, basically, um, almost my entire, uh, career as a research scientist. So, you know, dating back to the earliest program I worked on, uh, for, DARPA was back in 2019. And, um, so this was, this was before the large language model was a predominant form of the technology or kind of became synonymous with AI. So.
CRob (04:04.719)
Mm.
Michael Brown (ToB) (04:20.792)
For a long time, I’ve been working and trying to understand how we can apply techniques from AI ML modeling to security problems and doing the problem formulation, make sure that we’re applying that in an intelligent way where we’re going to get good solid results that actually generalize and scale. So as the large language model came out and we started recognizing that certain problems within the security domain are good for large language models, but a lot of them aren’t.
When the AI cyber challenge came around, we always approached this and this, I was the lead designer, my co-designer, Ian Smith. And I, you know, when we sat down and make the, the original concept for what became Buttercup, we always took an approach where we were going to use the best problem solving technique for the sub problem at hand. So when we approached this giant elephant of a problem, we did what you do and you have an elephant and you’ve got to eat it, eat it one bite at a time.
So each bite we took a look at it and said, okay, you we have like these five or six things that we have to do really, really well to win this competition. What’s the best way to solve each of these five or six things? And then the rest of it became an engineering challenge to chain them together. Our approach is very much a hybrid approach. This was a similar approach taken by the first place winners at Georgia Tech, which by the way, if you’ve got to be beat by anybody being beat by your alma mater, it takes a little bit of this thing out of it. So, you know, we came in first and second place. It’s funny, I actually have another Georgia Tech PhD alumni.
CRob (05:33.832)
You
Michael Brown (ToB) (05:42.926)
on my team who worked on Buttercup. So Georgia Tech is very well represented in the AI cyber challenge. So yeah, we’ve always had a hybrid approach. The winning team had a hybrid approach. So we used AI where it was useful. We used conventional software analysis techniques where they were useful. And we put together something that ultimately performed really, really well and exceeded even my expectations.
CRob (05:45.458)
That’s awesome.
CRob (06:07.56)
I can say I mentioned in previous talks, I was initially skeptical about the value that could have been derived from this type of work. But the results that you and the other competitors delivered were absolutely stunning. You have converted me into a believer now that I think AI absolutely has a very positive role can play both in the research, but also kind of the vulnerability and operations management space. What
Looking at your buttercup. What is unique about your approach with the cyber reasoning system with the buttercup?
Michael Brown (ToB) (06:45.39)
Yeah, so it’s funny you say that we converted you. I kind of had to convert myself along the way. There was a time in this competition where I thought, you know, this whole thing was going to kind of be reliant on AI too much and was going to fall on its face. And then, you know, at that point, I’d be able to say like, see, I told you, you can’t use LLMs for everything. But then it turns out, you know, as we got through there, we use LLMs for two critical areas and they worked much better than I thought they would. I thought they would work pretty well, but they ended up working to a much better degree than I thought they actually would. you know, what makes Buttercup unique?
CRob (06:49.852)
Yeah.
CRob (07:00.678)
You
Michael Brown (ToB) (07:15.69)
is that, like I said, we take a hybrid approach. We use AIML for very good problems that are well-suited for AIML. And what I mean by that is when we employ large language models, we use them on small subproblems for which we have a lot of context. We have tools that we can install for the large language model to use to ensure that it creates valid outputs and outputs that can carry on to the next stage with a high degree of confidence that they’re correct.
CRob (07:30.076)
Mm-hmm.
CRob (07:43.912)
Mm-hmm.
Michael Brown (ToB) (07:45.934)
And then in the places where we have to create or sorry, in one of the places where we have to use conventional software analysis tools, those areas are very amenable to the conventional analysis. So, you what I mean by this? good example is, for example, we needed to produce a proof of vulnerability. We have to have a crashing test case to show that when we claim a vulnerability exists in a system, we can prove through reproduction that it actually exists. Large language models aren’t great.
at finding these crashing test cases just by asking it to look at the code and say, hey, what’s going to crash this? They don’t do very well at that. They also don’t do well at generating an input that will even get you to a particular point in a program. But fuzzers do a great job of this. So we use the fuzzer to do this. But one of the things about fuzzers is they kind of take a long time. They’re also more generally aimed at finding bugs, not necessarily vulnerabilities.
CRob (08:36.808)
Mm-hmm.
Michael Brown (ToB) (08:42.702)
So we use an AIML, or Large Language Model based accelerator, or a C generator, to help us generate inputs that were going to guide the fuzzer to either saturate the fuzzing harnesses that existed for these programs more quickly. They would help us find and shake loose more crashing inputs that correspond to vulnerabilities as opposed to bugs. And those things really, really helped us deal with some of the short analysis and short.
processing windows that we encountered in the AI cyber challenge. So was really a matter of using conventional tools, but making them work better with AI or using AI for problems like generating software patches for which there really aren’t great conventional software analysis tools to do that.
CRob (09:28.018)
So as you were going through the competition, which went through multiple rounds, are there anything that surprised you or that you learned that, again, you said your opinion changed on using AI? What were maybe some of the moments that generated that?
Michael Brown (ToB) (09:45.226)
Yeah, so there I mean there were a couple of them. I’ll start with one where I can pat myself on the back and I’ll finish with one where I was kind of surprised. So first, we had a couple of moments that were really kind of vindicating as we went through this. Our opinion going into this was that large language models, couldn’t just throw the whole problem at it and expect it to be successful. So going into this, there was a lot of things that we did that we did
CRob (09:49.405)
hehe
Michael Brown (ToB) (10:14.774)
two years ago when we first started out that, you know, that’s like two years ago, it’s like five lifetimes when it comes to the development of AI systems now. So there were some things that we did that didn’t exist before that became industry standard by the time we finished the competition. So things like putting your LLM queries or your LLM prompts in a workflow that includes like validation with tools or the ability to use tools.
CRob (10:29.298)
Mm-hmm.
Michael Brown (ToB) (10:43.062)
That was something that was not mainstream when we first started out, but that was something that we kind of built custom into Buttercup when it came particularly to patching. And then also using a multi-agent approach. know, a lot of the, you know, I don’t know, a lot of the hype around AI is that, know, you just ask it anything and it gives you the answer. You know, we’re asking a lot of AI when we say, here’s a program, tell me what vulnerabilities exist, prove they exist, and then fix them for me.
And also don’t make a mistake anywhere along the way. It’s way too much to ask. we found particularly with patching, were, back then, multi-agent systems or even agentic systems or multi-agentic systems were unheard of. were still just, we were still using chat GPT 3.5, still very much like chatbot interactions, web browser interactions.
CRob (11:16.564)
Yeah.
Michael Brown (ToB) (11:36.438)
integration into tools was certainly less widespread. So we had seen some very early work on archive about solving complex problems with multiple agents. So breaking the problem down for it. And we used this and our patcher ended up being incredibly good. was our most important and our biggest success on the project. Really want to shout out Ricardo Charon, who’s our leader, the lead developer for our patching agent.
CRob (11:47.976)
Mm-hmm.
Michael Brown (ToB) (12:06.414)
or for a patching system within for both the semi finals and finals in the ICC. He did an incredible job and we really built something that I, like I said, I regard as our biggest success. So, know, sure enough, and as we go through this two year competition, now all of a sudden, you know, multi agentic systems, multi agentic tool enabled systems, they’re all the rage. This is how we’re solving these challenging problems. And also a lot of this problem breakdown stuff that has made its way baked into the models now, the newer thinking and reasoning models from
Anthropic and open AI respectively. They you you can give it these large complicated problems and will do it will first try to break it down before trying to solve it. So you know we were building all that stuff into our system before. It came about so that that’s an area where you know, like I said, we learned along the way that we had the right approach from the beginning and it’s really easy to go back and say that that’s what we learned that we were right. So on the other side of this I will have to say, you know, I’ll reiterate I was really surprised at how well.
CRob (12:53.639)
Mm-hmm.
Michael Brown (ToB) (13:04.11)
language models were able to do some of the tasks we asked it to do. Part of it’s how we approach the problem. We didn’t ask too much of it. And I think that’s part of the reason why the large language models were successful. an area that I thought where it was going to be much more challenging was patching. But it turned out to be an area where, a certain degree, this is kind of an easier version of the problem in general because open source software, which are the targets of the AI cyber challenge, they’re ingested into the training.
CRob (13:08.924)
Mm.
Michael Brown (ToB) (13:31.404)
data for all of these large language models. the models do have some a priori familiarity with the targets. So when we give it a chunk of vulnerable code from a given program, it’s not the first time it’s seen the code. But still, they did an amazing job actually generating useful patches. The patch rate that I expected personally to see was much lower than the actual patch rate that we had, both in the semifinals and in the finals. So even in that first year window,
CRob (13:33.64)
Mm.
Michael Brown (ToB) (13:58.63)
I was really kind of blown away with how well the models were doing at code generating, code generation tasks, particularly small focused code generation tasks. So I think, think large language models are kind of getting a bad rap right now when it comes to like, you know, trying to vibe code entire applications. They’re like, gosh, this, this code is slop. It’s terrible. It’s full of bugs and stuff. Well, well, you did also ask you to build the whole thing. You know, if I asked a junior developer to build a whole thing, they probably also put together some.
CRob (14:07.366)
Yeah.
CRob (14:17.233)
Yeah.
CRob (14:26.258)
Yeah.
Michael Brown (ToB) (14:26.71)
and gross stuff. But when I ask a junior developer to give me a bug fix, much like the large language model, when I ask it for a more constrained version of problem, they tend to do a better job because there’s just fewer moving parts. So yeah, those are are two things I took away. One that, you know, like I said, I get to pat myself on the back for another that was actually surprising.
CRob (14:46.012)
That’s awesome. That’s amazing. So now that the competition is over and what does the team plan to do next beyond this competition?
Michael Brown (ToB) (14:57.098)
Yeah, so I mean, look, we spent a lot of our time over the last two years. A lot of I wouldn’t quite say blood. I don’t think anyone would bled over this, but we certainly had some tears. We certainly had a lot of anxiety. you know, we put a lot of we put a lot of ourselves into Buttercup. And so, you we want people to use it. So to that end, Buttercup is fully available and fully open source. know, DARPA made it a contingent, a contingency of participating in the competition that
CRob (15:09.917)
Mm-hmm.
Michael Brown (ToB) (15:26.892)
you had to make the code that you submitted to the semi finals and the finals open source. So we did that along with all of our other competitors, but we actually took it one step further. So the code that we submitted to the finals is great. It’s awesome, but it runs that scale. It used $40,000 of a $130,000, I think, total budget. And it ran across like an Azure subscription that had multiple nodes.
countless replicated containers. This is not something that everyone can use. We want everyone to use it. So actually in the month after we submitted our final version of the CRS, but before DEF CON occurred, where we figured out that we won, we spent a month making a version of Buttercup that’s decoupled from DARPA’s competition infrastructure. So it runs entirely standalone on its own, but more importantly, we scaled it down so it’ll run on a laptop.
CRob (16:18.696)
Mm-hmm.
Michael Brown (ToB) (16:25.154)
We left all of the hooks. We left all of the infrastructure to scale it back up if you want. So the idea now is that if you go to trail of bits.com slash buttercup, you can learn about the tool. have links to our GitHub repositories where it’s available and you can go, you can go download buttercup on your laptop and run it right now. And if you’ve got an API key, that’ll let you spend a hundred dollars. We can run a demo to show you that, we can find and patch a vulnerability and live.
CRob (16:51.496)
That’s easy.
Michael Brown (ToB) (16:53.164)
Yeah, so anyone can do this right now. So if you’re an organization that wants to use Buttercup, you can also use the hooks that we left back in to scale it up to the size of your organization, the budget that you have, and you run it at scale on your own software targets. So even users beyond the open source community, we want this to be used on closed source code too. So yeah, our goal is for, you you asked what we’re gonna do with it afterward. We made it open source, we want people to use it.
And even on top of that, we don’t want it to bit rot. So we actually are going to retain a pretty significant portion of our winnings of our $3 million prize. We’re going to retain a pretty significant portion of that. And we’re going to use it for ongoing maintenance. So we’re maintaining it. We’ve had people submit PRs that we’ve accepted. They’re tiny, know, it’s only been out for like a month, but you know, and then we’ve also made quite a few updates to the public version of Buttercup afterwards. So it’s actively maintained.
There’s money from the company’s putting its money where its mouth is. We’re actively maintaining it. The people who built it are part of the people who are maintaining it. We are taking contributions from the community. We hope they help us maintain it as well. And yeah, we’ve made it so anyone can use it. I think we’ve taken it about as far as we can possibly go in terms of reducing the barriers to adoption to the absolute minimum level for people to use Buttercup and leverage AI to help them find and patch vulnerabilities at scale.
CRob (18:16.716)
I love that approach. Thank you for doing that. How did you fit the giant check through the teller window?
Michael Brown (ToB) (18:22.574)
Fortunately, that check was a novelty and we did not actually have a larger problem than the AICC itself to solve afterward, which was getting paid. So yeah, we did have the comically large check, you know, taped up in our booth at the AICC village at DEF CON and it certainly attracted quite a few photographs from passersby.
CRob (18:26.716)
Ha ha ha!
CRob (18:31.964)
Yeah.
CRob (18:37.864)
Mm-hmm.
Michael Brown (ToB) (18:47.736)
I don’t know. think if you get on like social media or whatever and you look up AICC, if you see anything, there’s probably lots of pictures of me throwing a big smile up and you two thumbs up underneath the check that random people took.
CRob (18:59.464)
So you mentioned that Buttercup is all open source now. So if someone was interested in checking it out or possibly even contributing, where would they go do that?
Michael Brown (ToB) (19:07.564)
Yeah, so we have a GitHub organization. We’re Trail of Bits. You can find Buttercup there. You can also find our public archives of the old versions of Buttercup. So if you’re interested in what, like, the code that actually won the competitions, you can see what got us from the semifinals to the finals. You can see what won us second place in the finals. And you can also download and use the version that’s actively maintained that’ll run on your laptop. And all three of them are there. Their repository name is just Buttercup.
We are not the only people who love the princess bride. So there are other repositories named butter. Yeah. There are other, there are other repositories named butter cup on GitHub. So you might have to sift it a little bit, but yeah, basically it github.com slash trailer bits slash butter cup, I think is like 85 % of the URL there. don’t have it memorized, but, but yeah, you can find it publicly available and, along with a lot of other tools that, trailer bits has made over the years. So we encourage you to check some of those out as well. A lot of those are still actively maintained.
CRob (19:39.036)
That’s what it was.
Michael Brown (ToB) (20:03.72)
have a lot of community support. believe it or not at last, I counted like something like 1250 stars. buttercup is only like our fifth most popular tool that, that trail of bits has created. So, you know, we, we were quite, we were quite notable for creating some binary lifting tools that are up there. we have also some other tools that we’ve created recently for, parser security, analysis like that, like graftage.
And then also some kind of more conventional security tools like algo VPN, like still rank above buttercup. So as awesome as buttercup is, it’s like, it’s like only the fifth coolest tool that we’ve made as voted on by the community. So check out the other stuff while you’re there too. believe it or not, buttercup isn’t, isn’t, isn’t our most popular, offering.
CRob (20:51.56)
Pretty awesome statement to be able to say. That’s only our fifth most important tool.
Michael Brown (ToB) (20:53.966)
Yeah.
Michael Brown (ToB) (20:58.444)
I don’t know, you know, like personally, I’m kind of hoping that maybe we move up a few notches after people get time to like go find it and, you know, and start it. But, you know, we, we’ve, we’ve made some other really significant and really awesome contributions to the community, even outside of the AI cyber challenge. So I want to really, really stress all of that stuff is open source. We, you know, we, we aren’t just doing this because we have to, we actually care about the open source community. We want to secure the software infrastructure. We want people to use the tool and secure the software for, you know, before they, before they, you know,
Get it out there so that you we we we tackle this like kind of untackable problem of securing this massive ecosystem of code.
CRob (21:37.606)
Michael, thank you to Trey Labitz and your whole team for all the work you do, including the competition runner-up Buttercup, which did an amazing job by itself. Thank you for all your work, and thank you for joining us today.
Michael Brown (ToB) (21:52.802)
Yeah, thanks for having me. You one last thing to shout out there. If you’re an organization, you’re looking to employ Buttercup within your organization. Don’t be bashful about reaching out to us and asking about use cases for deploying within your organization. We know we’re happy to help out there. That’s probably an area that we focus on a little bit less in terms of getting this out the door for average folks to use or for individuals to use. So we’re definitely interested in helping to make sure Buttercup gets used.
Like I said, reach out to us, talk to us if you’re interested in Buttercup, we want to hear.
CRob (22:23.44)
Love it. All right. Have a great day.
Michael Brown (ToB) (22:25.678)
All right, thanks a lot.