What’s in the SOSS? Podcast #2 – Christoph Kern and the Challenge of Keeping Google Secure

Summary

In this episode, Omkhar talks to Christoph Kern, Principal Software Engineer in Google’s Information Security Engineering organization. Christoph helps to keep Google’s products secure and users safe. His main focus is on developing scalable, principled approaches to software security.

Listen on Apple Podcasts Listen on Spotify Listen on Overcast Listen on Pocket Casts

Conversation Highlights

00:42 – Christoph offers a rundown of his duties at Google
01:38 – Google’s general approach to security
03:02 – What Christoph describes as “stubborn vulnerabilities” and how to stop them
06:42 – An overview of Google’s security ecosystem
10:00 – Why memory safety is so important
12:23 – Solving memory safety problems via languages
16:23 – Omkhar’s rapid-fire questions
18:28 – Why Christoph thinks this may be a great time for young professionals to enter the cybersecurity industry

Episode Links

Transcript

Christoph Kern soundbite (00:01)
The White House just put out a memo talking about memory safety and formal methods for security. I would have never believed this a couple of years ago, right? It’s becoming a more important table-stakes. It might be actually a very interesting time to get into this space without having to sort of swim upstream the whole time.

Omkhar Arasaratnam (00:17)
Hi everyone, it’s Omkhar Arasaratnam. I am the general manager of the OpenSSF and the host of the What’s in the Sauce? podcast. With us this week, we have Christoph Kern. Christoph, welcome.

Christoph Kern (00:30)
Thank you, Omkhar, for having me. It’s an honor to be here, and I’m looking forward to this conversation.

Omkhar Arasaratnam (00:34)
It’s a pleasure, Christoph. So, background. Tell us a little bit about where you work and what you do.

Christoph Kern (00:42)
I’m a principal engineer at Google. I’ve been there about 20 years and a bit, so quite a long while. I work in our information security engineering team, which is basically product security. So we look after the security posture of all the services and applications that Google offers to our users and customers. And a lot of that time, I focused on essentially trying to figure out scalable ways of providing security posture across hundreds of applications and to a high degree of assurance at that.

Omkhar Arasaratnam (01:13)
Well, I think if memory serves, we spoke a couple of times when I was at Google, a couple of times after Google. I mean, securing Google full stop, no caveat, no asterisk. That’s a lot of stuff. So what are some of the ways that y’all have thought about securing all the things within Google? I presume you just don’t have a fleet of security engineers that descend upon every project.

Christoph Kern (01:38)
Right, exactly. To make this scale, you really have to think about invariance that you want to hold for every application, and also classes of common defects that you want to avoid having in any of these hundreds of applications. And the traditional way of doing this has been to try to educate developers and to use sort of after the fact code reviews and testing and penetration testing and, you know, In our experience, this has not actually worked all that well. And we, over the years, sort of realized that we really need to think about the environments in which these applications are being built. And so usually there’s like many applications that are fairly similar, right? Like we have hundreds of web front ends and they have many aspects of their threat model that are actually the same for all of them, right?

Cross-site scripting, for instance, is an issue for every web app, irrespective of whether it’s a photo editor or a banking app or an online email system. And so we can kind of take advantage of this observation to scale the prevention of these types of problems by actually building that into the application framework and the underlying libraries and the entire developer ecosystem, really, that developers use to build applications. And that has turned out to work really quite well.

Omkhar Arasaratnam (02:53)
Now, in the past, you’ve referred to this class of stubborn vulnerabilities. Can you say a little bit more about stubborn vulnerabilities and what makes them so stubborn and hard to eliminate?

Christoph Kern (03:02)
Yeah, there’s a list of vulnerabilities that the folks who make this common weakness enumeration, the CWE, put out. So they’ve been putting out the, sort of, top 25 most dangerous vulnerabilities list for years. And recently, they started also making a list of the ones that consistently appear near the top of these lists over many years. And those are then, evidently, classes of problems that are in principle well understood.

We know exactly why they happen and what the, sort of, individual root cause is, and yet it turns out to be extremely difficult to actually get rid of them at scale and consistently. And this is then evidenced in the fact that they just keep reappearing, even though there’s been guidance on how to, in principle, in theory, avoid them for years, right? And it’s well understood what you, in principle, need to do. But applying that consistently is very difficult.

Omkhar Arasaratnam (03:52)
Software engineer to software engineer. What’s the right way of fixing these vulnerabilities? I mean we’ve thrown WAFs at them, we’ve taken all kinds of input validation techniques. What would you recommend? Like, how does Google stop those?

Christoph Kern (04:06)
I think the systemic root cause for these vulnerabilities being so prevalent is that there is an underlying API that developers use that puts the burden on developers to use it correctly and safely. Essentially, all of these APIs that are in this class of injection vulnerabilities consume a string, a sequence of characters, that is then interpreted in some language. It could be SQL in the case of SQL APIs. And then leading to SQL injection or JavaScript embedded in HTML in the case of XSS, right?

And the burden is on developers to make sure that when they write code, they assemble strings that are then passed to one of those APIs, that the way they’re assembled is following secure coding guidelines. In this case, questions of how you would escape or sanitize an untrusted string that’s embedded in HTML markup, for instance. And you have to do this hundreds of times over in a large application because there’s lots of code in a typical web app that assembles smaller strings into HTML markup that is then shipped to a browser and rendered. And it’s extremely difficult to not forget in one of those places or apply the wrong rule or apply it inconsistently. And this is just really, really difficult, right? And this is why those vulnerabilities keep appearing.

Now, to get rid of them, what we found, the only thing that actually works, is to really rethink the design of the API and change it. And so we just went ahead effectively and changed the API so it no longer consumes a string, but rather consumes a specific type that is dedicated to that API and essentially holds the type contract, the promise, that its value is actually safe to use in that context. And then we provide libraries of builders and constructors that are written by experts, by security engineers, that actually follow safe coding rules.

And then as an application developer, you really don’t have the opportunity to incorrectly use that API anymore, because the only way to make a value that will be accepted by the API is to use those expert-built libraries, right? And then effectively the type system of the language just glues everything together. It then also makes sure that when a value is constructed in one module, there’s like some module, maybe even in a backend, that makes HTML markup or a snippet of HTML markup that’s shipped to a browser and then embedded into the DOM in the browser, the type system ties those two places that are otherwise very difficult to understand because they’re very far away. They might be written by different teams. The type system ties those two things together and actually makes sure that the underlying coding rules are actually followed consistently and always.

Omkhar Arasaratnam (06:42)
Other than SQL injection and cross-site scripting, can you provide any other practical examples or maybe just to reflect back on how this has shown up in the security properties of Google products? Has this been broadly adopted by Google developers? Has there been some resistance? Can you talk a little bit about that from a developer experience perspective?

Christoph Kern (07:06)
The way Google’s developer ecosystem evolved for different reasons, really for productivity and quality reasons, the design of that ecosystem actually helped us greatly, right? So Google has this single monorepo where all the common infrastructure, including compilers and toolchains and libraries, are provided by central teams to all the application developers. And there’s really no practical way for somebody to build code without using that. It would be just very expensive and outlandish to even think of. And so if we build these things into those centrally-provided components, and we do it in a way that doesn’t cause undue friction, most people just don’t even notice.

They’ll just use a different API to make strings that get sent to a SQL query, and it just works, right? If it doesn’t work, then they’ll read the document and say, “Oh, this API wants a trusted SQL string instead of a string, so I’ll have to figure out how to make that and here’s the docs.” And once they figure this out once, they’re on their way. And so we’ve actually seen fairly little resistance to that. And of course, we’ve designed it so that it’s easy to use, right, otherwise we would see complaints.

One interesting thing we’ve done, I think that actually sort of in hindsight helped a lot is that we’ve chosen ourselves to make the maintainers and developers of these APIs, the security engineers, the first line customer support, so to speak, for developers using them. So we have this internal, sort of, equivalent of Stack Overflow, where people can ask questions. And our team actually monitors the questions about these APIs. And that inherently requires us to, so we don’t get drowned in questions or problems, to design them and iterate them on an ongoing basis to make them easier to use. So that in almost all use cases, developers can just be on their way by themselves without needing any help. And so that’s really helped to sort of tune these APIs and figure out their corner cases in their usability and make them both easy to use and secure at the same time.

Omkhar Arasaratnam (09:00)
That’s a wonderful overview. And just to summarize, by baking these kind of protections right into the tooling that the developers use, they don’t have to waste mental effort on trying to figure out how to sanitize a string. It’s already there. It’s already present. If you have a new developer coming in from the outside who maybe doesn’t have experience with using these trusted types, the actual API that they would call won’t accept a raw string. So they’re forced into it.

And I guess the counterbalance to ensure that you have a usable API for your tens of thousands of developers within Google is that essentially the people that write this also have to support it. So it’s in their best interest to make it as friction-free as possible for the average developer inside of Google. I think that’s, that’s excellent.

We’re going to switch gears. Google recently published a paper on memory safety of which you were one of the co-authors. So let’s talk about memory safety a little bit. Can you explain to the listeners why it is important?

Christoph Kern (10:00)
Yes, I think memory safety essentially, or the memory safety problem, is essentially an instance of this sort of problem we just talked about, right? It is due to the design in this case of a programming language or programming languages that have language primitives that are inherently unsafe to use, where the burden is on the developer to make sure that the surrounding code ensures the safety precondition for that primitive. So for instance, if you’re dereferencing a pointer and C or C++, it’s your responsibility as a programmer to be sure that anytime execution gets to that point, that pointer still points to validly allocated memory, and it hasn’t been deallocated by some other part of the code previously before you got here, right?

And so if that was the case, it would lead to a temporal safety violation because you have like a use-after-free, for instance, vulnerability. Similarly, when you’re indexing into an array, it’s your responsibility to make sure that the index is in bounds and you’re not reading off the end of the array or previous to the beginning. Otherwise, you have a spatial safety issue.

And I think what makes memory safety particularly stubborn is that the density of uses of potentially unsafe APIs or language features for memory safety is orders of magnitude more than some of these other vulnerability classes. So if you look at SQL injection, in a large program you might maybe have tens of places where the code is making a SQL query versus in a large C program or C++ program, you’ll have you know, thousands or tens-of-thousands of places that dereference pointers, like every other line of code literally is a potential safety violation, right? And so with that density of potential mistakes, there will be mistakes.

There’s absolutely no way around it. And that sort of is borne out by experience in that code that is written in an unsafe language tends to have its vulnerabilities be memory safety vulnerabilities.

Omkhar Arasaratnam (11:50)
Many languages nowadays, be it Python, JavaScript, or for lower level software development, Golang or Rust, all proclaim these memory safety properties and are often referred to with absolutes, as in solving an entire class of problem. I think you and I have been around software engineering for long enough that such bold claims are often met with a bit of cynicism. Can you talk about how these languages are actually solving these entire classes of memory safety problems?

Christoph Kern (12:23)
Yes, I think the key to that is that if you use them to good effect, you can design your overall program to enable modular reasoning about the safety of the whole thing. And in particular, design the small fragments of your code that do need to use unsafe features. So in Rust, it might be that you might need a module. For instance, if you want to implement a linked list, a doubly linked list, you need unsafe, right? Or you need reference counters.

But what you can do is write this one module that uses potentially unsafe features so that it is self-contained and its correctness and safety can be reasoned about without having to think about the rest of the program. So basically when you write this linked list implementation, for instance, in Rust, you will write it in a way such that the assumptions it needs to make about the rest of the program are entirely captured in the type signatures of its API.

And you can then validate by really thinking about it hard, but it’s a small piece of code, and you might get your experts in unsafe Rust to look at it, that module will behave safely for any well-typed program it is embedded in, right? And once you are in that kind of a place, then you will get actually a very high assurance of safety of the whole program. Just out of the fact that it type checks, right?

Because the components that use unsafe features are safe for any well-typed caller, and the rest of it is inherently safe due to the design of the language, there really is very little opportunity for a potential mistake. And that’s, I think, again, borne out of practice in that, like, in Java or JVM-based languages, memory safety really is a very rare problem. We’ve had some buffer overflows in, I don’t know, like, image parsers that use native code and stuff like that.

But it’s otherwise a relatively rare problem compared to the density of this type of bug in code that’s written in a language where unsafety is basically everywhere across the entire code base.

Omkhar Arasaratnam (14:23)
So, I mean, the obvious thing seems to be, OK, let’s wave our magic wands and rewrite everything in a memory-safe language. Obviously, things aren’t that simple. So what are the challenges with simply shifting languages, and how do you address large legacy code bases?

Christoph Kern (14:39)
Unless there is some breakthrough in like ML-based automated rewriting, I think we have to live with the assumption that the vast majority of C++ code that is in existence now will keep running as C++ until it reaches its natural end of life. And so we have to really think about, as we make this transition to memory safety, probably over a span of like decades, really, where do we put our energy to get the best benefit for our investments in terms of increased security posture?

And so I think there’s a couple of areas where we can look at, right? So for instance, there’s some types of code that are particularly risky, it’s most likely very valuable to focus on those and replace them with a memory-safe implementation. So we might replace an image parser that’s written in C or C++ with one that’s written in Rust, for instance.

And then beyond that, if we have a large C++ code base that we can’t really rewrite feasibly and we can’t just stop using it because we need it, we’ll have to look at incremental ways we can improve its security posture. And there is some interesting approaches for instance, we for instance, I think are somewhat confident that it’s possible to achieve a reasonable assurance for spatial safety in C++ through approaches like safe buffers by basically adding runtime checks.

For temporal safety, it’s much more difficult. There are some ideas, you know, there’s like some work in Chrome for instance, using these wrapper types for pointers called MiraclePtr. There might be some hardware mechanisms like MTE. And there’s a lot of trade-offs between cost and performance impact and achievable security improvement that will really probably take some time to shake out. But you know, we’ll get there at some point.

Omkhar Arasaratnam (16:23)
I’m glad to hear that the problem is at least tractable now. Moving over to the next part of our podcast Christoph we’re gonna go through a series of rapid-fire questions. Some of these will have one or two options, but the last option is always, “No Omkhar. Actually, I think it’s this.” So we’re gonna start off with one that’s quite binary, which is spicy or mild food.

Christoph Kern (16:45)
I don’t think it’s actually that binary. In the winter, when it’s cold out, I tend to gravitate to more like sort of, you know, savory German type cooking. That’s my cultural background. And then in the summer, I’m more leaning towards the like zesty, more spicy flavors, you know, so it maybe varies throughout the year.

Omkhar Arasaratnam (17:02)
Interesting. For me I tend to gravitate to spicy as a default, but then when the weather gets cooler, I find that spicy is even higher priority for me as it helps me to feel a bit warm. OK, the next one’s a bit controversial one based on some previous guests: VI, VS Code, Emacs?

Christoph Kern (17:22)
For me, it really depends on what I’m working on. I’ll use whatever code editor that’s most well supported for the language. So for like, say Rust it might be VS Code, but then in Google we have our own thing that’s supported by a central team. My muscle memory is definitely VI key bindings, but I actually, at the age of like 45 or something, I decided to finally learn Emacs so I could use org mode, but I do use it with a VI key binding, so.

Omkhar Arasaratnam (17:48)
Excellent. Tabs or spaces?

Christoph Kern (17:51)
You know, I haven’t thought about that in a long time. I think many years ago, the language platform teams at Google basically decided that all code needs to be automatically formatted. And so basically, you do whatever the thing does and what’s built into the editors. And it never really occurs even as a question anymore.

Omkhar Arasaratnam (18:09)
Makes it easier. One less thing to worry about. To close it out, what advice do you have for somebody entering our field today, somebody that’s just graduating with their undergrad in comp sci or maybe just transitioning into an engineering field from another field that’s interested in tackling this problem of security?

Christoph Kern (18:28)
You know, maybe it’s actually a particularly good time to get into this field. I think I’ve been very fortunate to have worked in an organization that really does make security a priority. And so usually when you approach somebody you want to work with on improving security posture of a product, it’s rarely a question of whether or not this should be done at all.

You don’t have to justify your existence, right? It’s really usually questions about the engineering of exactly how to do it. And that’s a very nice place to be, right? You’re not constantly arguing to even be there, right? At the table. And I think maybe I’m a little hopeful that this is now changing for other organizations where that’s not so obvious, right? Like, you hear a lot more talk about security and security design.

I mean, the White House just put out a memo talking about memory safety and formal methods for security. It was like, I would have never believed this if you’d told me this a couple of years ago, right? So I think it’s becoming a more important and sort of obvious table-stakes part of the conversation. And so it might be actually a very interesting time to get into this space without having to sort of swim upstream the whole time.

Omkhar Arasaratnam (19:34)
We’ve talked about stubborn vulnerabilities, safe coding, memory safety. What is your call to action for our listeners, having absorbed all this new information?

Christoph Kern (19:44)
I think it is well past time to no longer put the burden on developers and really view these problems as systemic outcomes of the design of the frameworks and application frameworks and production environments, the entire developer ecosystem. Like in the article we recently published, we kind of put this as t he security posture is an emergent property of this entire developer ecosystem, and you really can’t actually change the outcome by not focusing on that and only blaming developers. It’s not going to work.

Omkhar Arasaratnam (20:18)
Christoph, thank you for joining us, and it was a pleasure to have you. Look forward to speaking to you again soon.

Christoph Kern (20:23)
Thank you, yeah, it was a pleasure to be here.

Announcer (20:25)
Thank you for listening to What’s in the SOSS? An OpenSSF podcast. Be sure to subscribe to our series of conversations on Spotify, Apple, Amazon or wherever you get your podcasts. And to keep up to date on the Open Source Security Foundation community, join us online at OpenSSF.org/getinvolved. We’ll talk to you next time on What’s in the SOSS?

What’s in the SOSS? Podcast #2 – Christoph Kern and the Challenge of Keeping Google Secure

Summary

Conversation Highlights

Episode Links

Transcript

We envision a future where OSS is universally trusted, secure, and reliable. Join us in making open source more secure.

Get the latest announcements, event info, and the community news in your inbox