Skip to content

Add fast line rejection optimization leveraging bridges in the HIR graph#3272

Open
inicula wants to merge 7 commits intoBurntSushi:masterfrom
inicula:nicula/fast-line-rejections
Open

Add fast line rejection optimization leveraging bridges in the HIR graph#3272
inicula wants to merge 7 commits intoBurntSushi:masterfrom
inicula:nicula/fast-line-rejections

Conversation

@inicula
Copy link

@inicula inicula commented Feb 5, 2026

This patch introduces a (currently tentative) optimization for fast rejection of line candidates.

The optimization is based on the following observations:

  • The HIR structure essentially contains the steps for doing a Thompson construction by recursing into the various HIR sub-components.

  • After we obtain the Thompson NFA, we consider it as a graph but instead make its edges non-directed. The 'weight' on an undirected edge A-B is whatever the A->B transition needs to consume in the Thompson NFA.

  • In this non-directed Thompson NFA graph, we can see that the edges which are bridges and consume a single character, will be part of string literals which must necessarily appear in any line that matches the target regular expression.

    For example, we can consider the regular expression "a(b|c)d". The Thompson NFA for this expression will be:

                         b
                   (2) -----> (4)
                    ^          |
                    |          | λ
                  λ |          |
              a     |          v    d
      START -----> (1)        (6) -----> END
                    |          ^
                  λ |          |
                    |          | λ
                    v          |
                   (3) -----> (5)
                         c
    

    When we disregard the direction of the edges, we notice that the edges (START, 1) and (6, END) are bridges, which means, by definition, that in order to get from START to END, they must be traversed. Consequently, if they must be traversed, then characters "a" and "d" must appear in this exact order in any line that matches the given regular expression. This property gives us a way of quickly rejecting lines that don't match the given regular expression (or, alternatively, a way of quickly finding lines that possibly match it).

    The key difference between this approach and the string literal filtering that ripgrep already does (with Aho-Corasick), is that this approach preserves the order in which literals must appear.

  • The way we extract the bridge edges, and thus the way in which we extract the final literal sequence, is as follows:

    Let NCS(RegularExpression) -> [Character|Break] denote the Necessary Character Sequence of a regular expression.

    NCS is defined as follows:

    • NCS(<EMPTY STRING>) = []

    • NCS("<CHARACTER>") = ["<CHARACTER>"]

    • NCS(<R1><R2>) = NCS(<R1>) concat NCS(<R2>)

      i.e. the NCS of a concatenation is the concatenation of the separate NCS applications.

    • NCS(<R1>|<R2>) = [BREAK]

      i.e. the NCS of an alternation disregards the intermediate character sequences of the two possible expressions, and introduces a special BREAK character in the sequence which is discussed more below.

    You can also think about it like this:

    • bridges are only created by NCS("<CHARACTER>");
    • concatenations of regular expressions (i.e. NCS(<R1><R2>)) concatenate the bridges of their sub-expressions;
    • alternations discard the bridges of their sub-expressions and simply return a BREAK symbol.

    Example:

    • NCS("a(b|c)d") = ["a", BREAK, "d"]

    To transform this [Character|Break] sequence into the final literal sequence (i.e. [Literal], where a Literal is a sequence of Characters), we just iterate over it appending each Character to a temporary Literal result. When we encounter a BREAK, it means the current literal has ended and we need to begin a new one.

    To better understand the purpose of the BREAK symbol, consider the same example "a(b|c)d": since NCS("a") is ["a"] and NCS("d") is ["d"], we need to have a result in the middle (i.e. NCS("b|c")) which lets us know that we cannot concatenate ["a"] with ["d"] as if the literal "ad" was necessary for matches. In other words, we need to know when to BREAK a literal. When do we need BREAK in the sequence? Whenever we have a regular (sub)expression that has non-deterministic λ-transitions in its Thompson NFA!

    These non-deterministic λ-transitions can only be introduced by:

    • alternations (e.g. "a|b");
    • character sets, since they're equivalent to one or more alternations (e.g. "[ab]", "[A-Z]");
    • repetitions of the form <R1>{<min>, <max>} where <max> is strictly greater than <min> (this will include the special cases like "a?", "a+", and "a*").

Finally, given a haystack, the way in which we use the resulting literal sequence of the regular expression is by searching for each literal, in order, with memmem(). If one of the literals does not exist, then the haystack has no candidates.

Benchmark example (hyperfine):

Benchmark 1: target/release/rg "[A-Z]+_SUSPEND.*A" ../linux-stable
  Time (mean ± σ):      83.5 ms ±   4.7 ms    [User: 372.5 ms, System: 521.2 ms]
  Range (min … max):    74.0 ms …  92.6 ms    30 runs

Benchmark 2: rg "[A-Z]+_SUSPEND.*A" ../linux-stable
  Time (mean ± σ):     269.9 ms ±   9.4 ms    [User: 2540.1 ms, System: 398.2 ms]
  Range (min … max):   255.3 ms … 297.2 ms    30 runs

Summary
    target/release/rg "[A-Z]+_SUSPEND.*A" ../linux-stable ran
    3.23 ± 0.22 times faster than rg "[A-Z]+_SUSPEND.*A" ../linux-stable

TODO / I'd appreciate help with:

  • proving that this approach/algorithm is mathematically sound (i.e. it doesn't lead to valid line candidates being skipped);
  • thorough benchmarking such that we're sure this is actually an optimization overall and not a pessimization.

Notes:

  • since the approach that I'm trying to introduce will have cases where it extracts a lower amount of literals compared to the other literal-extraction techniques, perhaps we could explore a heuristic that chooses or combines those different approaches (for example: whenever my approach has a literal to search, but it doesn't find it in the haystack, then we know for sure that the entire haystack can be discarded; but if the required literals are found, it doesn't mean that my approach necessarily filters out lines faster than the other techniques; so we could try doing offset = my_technique(haystack[...]) and then resume with other_techniques(haystack[offset...]).

Improvements:

* extract literals from alternations by taking the common prefixes and
  suffixes out of the alternation itself;
* use the minimum required length from the HIR if it's bigger than the
  one this new approach calculates;
Make use of the new literal-extraction method, but still fall back to
the current methods in order to get to use *all* extracted literals, not
just the necessary, order-aware ones.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant