Running Sigstore as a Managed Service: A Tour of Sigstore’s Public Good Instance

By Evan Anderson, Stacklok

Note: This post is an extended version of the talk at OpenSSF Day Europe 2023.

One of the challenging problems in distributing software is verifying that the bits you’re about to install are the same as the ones the author created. If an attacker can change or corrupt those bits, this can lead to arbitrary code execution with the current user’s credentials. Signing binary outputs is a common defense against this attack, but introduces new problems in key management (how to store and protect the keys) and distribution (how to find the keys).

Sigstore, an open source project started as a collaboration between Red Hat and the Google Open Source Security Team, aims to solve both of these problems. When you sign and verify artifacts using Sigstore (for example, when signing containers with cosign), your local signing tool generates short-lived public and private keys, but it needs a trusted authority to sign those keys and the associated identity in a durable way where clients can trust it later. That mechanism is Sigstore – by default, the public good Sigstore instance run by the OpenSSF as a community-operated service for all software developers.

In some ways, the Sigstore public good instance is a bit like the CA service offered by LetsEncrypt – Sigstore provides a CA (Fulcio) for developers to generate short-lived signing certificates, a binary transparency log (Rekor) to record artifact signing events, and a timestamp service (part of Rekor) to ensure that the artifact was signed within the certificate’s validity window. The Sigstore security model is described in more detail on the Sigstore website. While individual organizations may choose to run their own Fulcio and Rekor instances, the public good instance provides an easy onboarding path for both open-source communities and organizations which do not want to maintain a separate Sigstore stack.

While several articles have been published about how to run your own Sigstore instance, it’s useful to understand how the public good instance is administered – both in terms of configuration and also policies and best practices.

Who Runs the Public Good Instance

The public good instance is run and managed by the OpenSSF; in particular, several member companies have volunteered engineers to participate in the on call rotation and maintain the public good instance. These companies include:

The on call rotation itself is managed by PagerDuty; in addition, the team operates on Slack and on a private GitHub infrastructure as code (IaC) repository to manage the configuration and respond to alerts. While the repository is part of the Sigstore organization, the repo contents are private as an additional security measure to protect operational details. Other operational tasks include monthly “game days” and periodic maintenance (such as sharding the certificate transparency log across MySQL instances). The on call team is responsible for both the production public good instance as well as a “staging” instance, using GitOps processes with code review and monitored rollout of changes.

Infrastructure Setup

The public good infrastructure is set up in two stages: first, the base cloud infrastructure (which runs on Google Cloud Platform) sets up a Kubernetes cluster and other managed cloud resources, then the Sigstore application is deployed through Kubernetes resources using ArgoCD and helm.

Base Cloud Infrastructure

The base cloud infrastructure (e.g., databases, key management services, Kubernetes clusters) is provisioned using Terraform on Google Cloud Platform. This configuration is defined as Terraform modules in the https://github.com/sigstore/scaffolding repository as a set of GCP resources. The actual Terraform values are managed in the private IaC repository, which contains settings for both staging and production. The staging and production infrastructure is provisioned in separate GCP projects, which allows for validating changes in staging before applying them to production.

In addition to the Terraform infrastructure defined in the scaffolding repo, the private instance also defines a number of resource configurations that are specific to the public good instance, such as IAM Roles, DNS configuration, and a specific ingress setting, like CDN publishing of the TUF roots. Much like the values for the scaffolding modules, these are deployed with the same tools and code review processes; they’re separate from the scaffolding project because they reflect the specifics of the on call team and public service configuration.

Lastly, the cloud infrastructure components also deploy ArgoCD and External Secrets onto the Kubernetes cluster (again, from configuration in the scaffolding repo). These two components enable bootstrapping the rest of the Kubernetes infrastructure, including the configuration of Fulcio, Rekor, and all the supporting services.

Kubernetes Infrastructure

The Sigstore Kubernetes configuration is deployed using Helm charts defined in the https://github.com/sigstore/helm-charts repository, applied via ArgoCD. By using ArgoCD rather than simply applying the helm charts, the manifests are continuously reconciled – any changes to the Kubernetes cluster made outside of source control will be immediately reverted. Additionally, ArgoCD provides a useful web UI for understanding the current sync status of resources. Much like the scaffolding repo, the Helm configuration templates are public, while the values.yaml is stored in the ArgoCD configuration in the private IaC repo. Among other things, this configuration includes the definition of the supported OIDC providers which can be used with the “keyless” identity-provider based flow.

In addition to the Sigstore Helm charts, the Argo configuration also installs a number of other utilities on the cluster such as Prometheus and the reloader tool to restart Deployments whose volume mount or environment variable values have changed. Additionally, the Kubernetes cluster also runs jobs which continuously probe the availability of Sigstore.

Probes

In addition to metrics-based alerting around capacity limits, the on call team also measures Sigstore availability using the probes defined in https://github.com/sigstore/sigstore-probers. These tools verify the correct operation of Sigstore by acquiring a certificate from Fulcio and then signing an artifact and ensuring that the signing event has been recorded correctly in Rekor. These probes are run from both the Kubernetes cluster (as described above) and from GitHub Actions running on the GitHub infrastructure running in Azure cloud. Running probes from two different cloud providers makes it easier to debug network connectivity issues separately from failures in the Sigstore application itself.

Additionally, Sigstore uses BetterUptime to track the uptime of various simpler probes; the team targets a 99.5% availability SLO.

Summary

Part of the magic of Sigstore is the fact that application artifacts can be signed without needing extensive infrastructure or key management for developers. In order to support this flow, the public good instance provides a reliable mechanism for provisioning signing certificates and recording signatures in the transparency log.

About the Author

Evan Anderson is a Principal Software Engineer at Stacklok, securing software supply chains using open source technologies. He has been working in cloud for almost 20 years, starting at Google’s private cloud and then building Google Compute Engine and various serverless offerings: Cloud Functions, Cloud Run, and Knative. About 4 years ago, he joined VMware and was part of the Tanzu Application Platform team until June 2023.