Expand description
Defines a Thompson NFA and provides the PikeVM and
BoundedBacktracker regex engines.
A Thompson NFA (non-deterministic finite automaton) is arguably the central data type in this library. It is the result of what is commonly referred to as “regex compilation.” That is, turning a regex pattern from its concrete syntax string into something that can run a search looks roughly like this:
- A
&stris parsed into aregex-syntax::ast::Ast. - An
Astis translated into aregex-syntax::hir::Hir. - An
Hiris compiled into aNFA. - The
NFAis then used to build one of a few different regex engines:- An
NFAis used directly in thePikeVMandBoundedBacktrackerengines. - An
NFAis used by a hybrid NFA/DFA to build out a DFA’s transition table at search time. - An
NFA, assuming it is one-pass, is used to build a full one-pass DFA ahead of time. - An
NFAis used to build a full DFA ahead of time.
- An
The meta regex engine makes all of these choices for you based
on various criteria. However, if you have a lower level use case, you can
build any of the above regex engines and use them directly. But you must start
here by building an NFA.
§Details
It is perhaps worth expanding a bit more on what it means to go through the
&str->Ast->Hir->NFA process.
- Parsing a string into an
Astgives it a structured representation. Crucially, the size and amount of work done in this step is proportional to the size of the original string. No optimization or Unicode handling is done at this point. This means that parsing into anAsthas very predictable costs. Moreover, anAstcan be round-tripped back to its original pattern string as written. - Translating an
Astinto anHiris a process by which the structured representation is simplified down to its most fundamental components. Translation deals with flags such as case insensitivity by converting things like(?i:a)to[Aa]. Translation is also where Unicode tables are consulted to resolve things like\p{Emoji}and\p{Greek}. It also flattens each character class, regardless of how deeply nested it is, into a single sequence of non-overlapping ranges. All the various literal forms are thrown out in favor of one common representation. Overall, theHiris small enough to fit into your head and makes analysis and other tasks much simpler. - Compiling an
Hirinto anNFAformulates the regex into a finite state machine whose transitions are defined over bytes. For example, anHirmight have a Unicode character class corresponding to a sequence of ranges defined in terms ofchar. Compilation is then responsible for turning those ranges into a UTF-8 automaton. That is, an automaton that matches the UTF-8 encoding of just the codepoints specified by those ranges. Otherwise, the main job of anNFAis to serve as a byte-code of sorts for a virtual machine. It can be seen as a sequence of instructions for how to match a regex.
Modules§
- backtrack
- An NFA backed bounded backtracker for executing regex searches with capturing groups.
- pikevm
- An NFA backed Pike VM for executing regex searches with capturing groups.
Structs§
- Build
Error - An error that can occurred during the construction of a thompson NFA.
- Builder
- An abstraction for building Thompson NFAs by hand.
- Compiler
- A builder for compiling an NFA from a regex’s high-level intermediate representation (HIR).
- Config
- The configuration used for a Thompson NFA compiler.
- Dense
Transitions - A sequence of transitions used to represent a dense state.
- NFA
- A byte oriented Thompson non-deterministic finite automaton (NFA).
- Pattern
Iter - An iterator over all pattern IDs in an NFA.
- Sparse
Transitions - A sequence of transitions used to represent a sparse state.
- Transition
- A single transition to another state.
Enums§
- State
- A state in an NFA.
- Which
Captures - A configuration indicating which kinds of
State::Capturestates to include.