Add fast line rejection optimization leveraging bridges in the HIR graph#3272
Open
inicula wants to merge 7 commits intoBurntSushi:masterfrom
Open
Add fast line rejection optimization leveraging bridges in the HIR graph#3272inicula wants to merge 7 commits intoBurntSushi:masterfrom
inicula wants to merge 7 commits intoBurntSushi:masterfrom
Conversation
Improvements: * extract literals from alternations by taking the common prefixes and suffixes out of the alternation itself; * use the minimum required length from the HIR if it's bigger than the one this new approach calculates;
Make use of the new literal-extraction method, but still fall back to the current methods in order to get to use *all* extracted literals, not just the necessary, order-aware ones.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This patch introduces a (currently tentative) optimization for fast rejection of line candidates.
The optimization is based on the following observations:
The HIR structure essentially contains the steps for doing a Thompson construction by recursing into the various HIR sub-components.
After we obtain the Thompson NFA, we consider it as a graph but instead make its edges non-directed. The 'weight' on an undirected edge
A-Bis whatever theA->Btransition needs to consume in the Thompson NFA.In this non-directed Thompson NFA graph, we can see that the edges which are bridges and consume a single character, will be part of string literals which must necessarily appear in any line that matches the target regular expression.
For example, we can consider the regular expression
"a(b|c)d". The Thompson NFA for this expression will be:When we disregard the direction of the edges, we notice that the edges
(START, 1)and(6, END)are bridges, which means, by definition, that in order to get fromSTARTtoEND, they must be traversed. Consequently, if they must be traversed, then characters"a"and"d"must appear in this exact order in any line that matches the given regular expression. This property gives us a way of quickly rejecting lines that don't match the given regular expression (or, alternatively, a way of quickly finding lines that possibly match it).The key difference between this approach and the string literal filtering that
ripgrepalready does (with Aho-Corasick), is that this approach preserves the order in which literals must appear.The way we extract the bridge edges, and thus the way in which we extract the final literal sequence, is as follows:
Let
NCS(RegularExpression) -> [Character|Break]denote the Necessary Character Sequence of a regular expression.NCSis defined as follows:NCS(<EMPTY STRING>) = []NCS("<CHARACTER>") = ["<CHARACTER>"]NCS(<R1><R2>) = NCS(<R1>) concat NCS(<R2>)i.e. the NCS of a concatenation is the concatenation of the separate NCS applications.
NCS(<R1>|<R2>) = [BREAK]i.e. the NCS of an alternation disregards the intermediate character sequences of the two possible expressions, and introduces a special
BREAKcharacter in the sequence which is discussed more below.You can also think about it like this:
NCS("<CHARACTER>");NCS(<R1><R2>)) concatenate the bridges of their sub-expressions;BREAKsymbol.Example:
NCS("a(b|c)d") = ["a", BREAK, "d"]To transform this
[Character|Break]sequence into the final literal sequence (i.e.[Literal], where aLiteralis a sequence ofCharacters), we just iterate over it appending eachCharacterto a temporaryLiteralresult. When we encounter aBREAK, it means the current literal has ended and we need to begin a new one.To better understand the purpose of the
BREAKsymbol, consider the same example"a(b|c)d": sinceNCS("a")is["a"]andNCS("d")is["d"], we need to have a result in the middle (i.e.NCS("b|c")) which lets us know that we cannot concatenate["a"]with["d"]as if the literal"ad"was necessary for matches. In other words, we need to know when toBREAKa literal. When do we needBREAKin the sequence? Whenever we have a regular (sub)expression that has non-deterministic λ-transitions in its Thompson NFA!These non-deterministic λ-transitions can only be introduced by:
"a|b");"[ab]","[A-Z]");<R1>{<min>, <max>}where<max>is strictly greater than<min>(this will include the special cases like"a?","a+", and"a*").Finally, given a haystack, the way in which we use the resulting literal sequence of the regular expression is by searching for each literal, in order, with
memmem(). If one of the literals does not exist, then the haystack has no candidates.Benchmark example (hyperfine):
TODO / I'd appreciate help with:
Notes:
offset = my_technique(haystack[...])and then resume withother_techniques(haystack[offset...]).