Skip to main content

Know Your Regular Expressions: Securing Input Validation Across Languages

By June 18, 2024Blog
KnowYourRegularExpressions

By David A. Wheeler

The Open Source Security Foundation (OpenSSF) Best Practices Working Group (WG) has just released a short guide, Correctly Using Regular Expressions for Secure Input Validation! Here’s why it’s important.

When developing secure software it’s important to do input validation, that is, to check all untrusted input so that only valid data is accepted. For example, if a value is supposed to be an integer, then the software should validate that the input really is an integer and reject anything else. Strict input validation can often prevent vulnerabilities, make them harder to exploit, or reduce the damage they can inflict. Sometimes a data type is so common that libraries and frameworks include a pre-written validator for it, but many applications have application-specific patterns that also need input validation.

Software developers often use regular expressions (aka. regexes) to validate input against specialized patterns. Regular expressions are a general mechanism that allow developers to quickly specify text patterns for various purposes, including for security. These regular expression notations, and the libraries that implement them, have been around since the late 1960s. Regular expressions can be a great way to validate specialized patterns because they’re widely available, widely understood, flexible, and efficient. However, if regular expressions are used to secure a program, they must be used correctly… and many developers have misconceptions about regular expressions.

One of the most common misconceptions among developers about regular expressions is that they believe these notations are exactly the same, everywhere, across all programming languages. This is not true. This is not new information; one book published 1997-2006 (Mastering Regular Expressions by Jeffrey E.F. Friedl) specifically noted many differences between the regular expression notations in various programming languages. Even the original POSIX standard, released in 1988, defined two different regular expression notations that had differences between them–and these were differences within the same specification! However, many of today’s developers have no idea that there are differences. A 2019 paper by Davis et al. found that of surveyed developers, 94% reuse regular expressions, 50% reuse them at least half the time, and 47% incorrectly believe that this notation is a “lingua franca” (that is, that it’s the same everywhere).

More recently, Seth Larson’s 2024 blog post Regex character “$” doesn’t mean “end-of-string” expressly noted that many developers incorrectly think that the symbol “$” always means “end of string” in all regular expression notations. It does have that meaning by default in POSIX, JavaScript, and Go. However, in many other languages, such as Python and PHP, the symbol “$” has a different meaning: it also matches an end-of-string if it’s preceded by a newline character (“\n”). In short, the same symbol has different meanings in different languages. This is vitally important for security, because “$” is often used to ensure that no “extra” characters can slip through input validation. If a developer believes that “$” does one thing, but it really does something else, this could lead to a vulnerability.

This isn’t the only problem using regular expressions, either. It’s all too easy to match on part of a value, instead of validating an entire value. Some regular expression systems are also vulnerable to certain denial-of-service attacks if used incorrectly. These are typically easy to solve, if the software developer knows about these issues.

The OpenSSF Best Practices Working Group (WG) has just released a short guide, Correctly Using Regular Expressions for Secure Input Validation. This guide helps software developers understand what they should do. It’s intentionally short, sharing just what a developer needs to know. If you want details, full rationale is publicly available in Correctly Using Regular Expressions for Secure Input Validation – Rationale.

We want to ensure that today’s software developers have the guidance they need to develop secure software. This short guide on correctly using regular expressions is a step in that direction.