I come from the world of "network intrusion detection", where we search network traffic for patterns indicating hacker activity. In many cases, this means solving the same problem of grep with complex regexes, but doing so very fast, at 10gbps on desktop-class hardware (quad-core Core i7). We in the intrusion-detection world have seen every possible variation of the problem. Concepts like "Boyer-Moore" and "Aho-Corasick" may seem new to you, but they are old-hat to us.
Your first problem is getting the raw data from the filesystem into memory. As the thread suggests, one way of doing this is "memory-mapping" the file. Another option would be "asynchronous I/O". When done right, either solution gets you "zero-copy" performance. On modern Intel CPUs, the disk controller will DMA the block directly into the CPU's L3 cache. Network cards work the same way, which is why getting 10-gbps from the network card is trivial, even on slow desktop systems.
Your next problem is stop with the line parsing, idiots. All these command-line tools first parse to the end-of-line, either explicitly (such as memchr()) or implicitly (such as reading input with fgets()). This double-parses the data -- and even memchr() is likely slower than the regex algorithm, unless you are using the new AVX "TEXT" instructions that can process 16 bytes per clock cycle.
I mention this because all the command-line tools, from grep to awk to wc, suffer from this problem. Consider wc, "word count", as bottom limit for a simple command-line, text-processing utility. What can be simpler than counting the number of words in an input file? In fact, it's needlessly complex and slow, such as double-parsing end-of-lines,. It therefore represents a sort of upper limit on parsing speed. You can almost always parse text files faster than wc can. Benchmark your text parser, and if it fails to be faster than wc, then go back and fix it.
I've created a DNS server that must parse the 8 gigabyte ".com" zone file. It does so several times faster than wc, even though the parsing (and building an in-memory database) is a much harder task. This demonstrates the problem that parsing end-of-line causes in code.
NFAs and DFAs
A regex gets converted into a finite-automata, either an NFA (nondeterministic finite-automata), or DFA (deterministic finite-automata).
An NFA uses low amount of memory, but a lot more CPU power. Some complicated regexes will cause unbounded CPU to be used. It's actually a vulnerability in systems that allow users to submit regexes -- they can submit some that cause the CPU to go into a nearly infinite loop.
A DFA uses a tiny amount of CPU power, but a corresponding large amount of memory. Some complicated regexes cause the amount of memory to explode -- though they are typically different expressions than those which cases NFA problems. Again, this is a security problem, as hackers can submit regexes that consume all memory.
The perfect regex system combines DFA and NFA. The DFA portion is for those things that encode well in a DFA with low memory, plus the first parts of those patterns that'll eventually match using NFA. It'll be very fast in the normal case, while also be memory efficient. Also, it should be able to avoid hostile patterns that cause memory or CPU to explode.
A DFA is essentially just a big table, with a state variable pointing to the current row. Each new byte of data then looks up in the current row to find the next row to point the state at.
The speed of DFA is about 9 instructions per byte input, regardless of the size of the table. Since Intel CPUs can easily execute 3 instructions per clock cycle, that's roughly 3 clocks per byte of input.
However, the limit is not the number of instructions executed, but the speed of the L1 cache. Each new byte of input requires reading a new table row from memory. In random input, these rows will be in L1 cache. Modern x86 processors require 4 clock cycles to read the L1 cache. Thus, each byte of input costs 4 clock cycles.
Consider Intel's latest "Skylake" Core i7 CPU that runs a quad-core at 4.0 GHz. That translates to a DFA running at 1 gigabyte per second, or 8 gbps. On four cores, that's essentially a theoretical speed of 32 gbps. That's why modern desktop CPUs are easily fast enough for something like a 10-gbps network intrusion detection system.
Note that much the same logic applies to low-end ARM CPUs, such as those found in cellphones and microservers like the Raspberry Pi.
The original grep thread, however, started with the Boyer-Moore algorithm. Consider if you are searching for the pattern "xxxxxxxxxx". This means that instead of looking at each byte of input, you can skip forward every 10 characters and test for an 'x'. If there's no 'x', then you know that the pattern can't fit within the previous 10 bytes, and you can just skip forward again. But, if there is an 'x', then you have to stop and search backwards for the start of the string, then test for a full match.
More complicated patterns like "abracadabra" means skipping forward, then testing to see if the character is one of "abcdr". Testing for multiple words at a the same time works the same way. Each skip is for the shortest word, and the characters tested are a combination of all the words.
Thus, for a 10 character pattern, Boyer-Moore is essentially 10x faster than a regex DFA. On the other hand, the system quickly breaks down when there are lots of patterns or any short patterns. As soon as a 3 byte pattern is entered into the mix, or there are enough characters that they start matching on random input, then the entire system becomes much slower than a DFA.
The perfect grep would therefore look like the following:
- got data into memory with zero-copy (either memory-map or async)
- didn't parse newlines first
- used mixed DFA and NFA for regex
- used Boyer-Moore instead for simple patterns