Showing posts with label C10M. Show all posts
Showing posts with label C10M. Show all posts

Thursday, February 21, 2013

Multi-core scaling: it’s not multi-threaded


I’m writing a series of posts based on my Shmoocon talk. In this post, I’m going to discuss “multi-core scaling”.

In the decade leading to 2001, Intel CPUs went from 33-MHz to 3-GHz, a thousand-fold hundred-fold increase in speed. In the decade since, they’ve been stuck at 3-GHz. Instead of faster clock speeds, they’ve been getting more logic. Instead of one instruction per clock cycle, they now execute four (“superscalar”). Instead of one computation per instruction, they now do eight (“SIMD”). Instead of a single CPU on a chip, they now put four (“multi-core”).

However, desktop processors have been stuck at four cores for several years now. That’s because the software is lagging. Multi-threaded software goes up to about four cores, but past that point, it fails to get any benefit from additional cores. Worse, adding cores past four often makes software go slower.

Wednesday, February 20, 2013

Custom stack: it goes to 11

I’m writing up my Shmoocon preso as a series of blogposts. Today I’m going to talk about custom network stacks.

The way network stacks work today is to let the kernel do all the heavy lifting. It starts with kernel drivers for Ethernet cards, which passes packets to the kernel’s TCP/IP stack. Upon the reception, the packet must make an arduous climb up the network stack until it finally escapes to user-mode. (User-mode is where applications run).

Tuesday, February 19, 2013

Unlearning College


I’m writing up my Shmoocon talk as a series of blog posts. In this post, I’m going to talk about the pernicious problem of college indoctrination. What colleges teach about networking is bad. A lot of material is out of date. A lot is just plane inaccurate (having never been “in date”). A lot is aspirational: networks don’t work that way, but your professor wishes they would. As a consequence, students leaving college are helpless, failing at real-world problems like portability, reliability, cybersecurity, and scalability.

Sunday, February 17, 2013

Scalability: it's the question that drives us

In order to grok the concept of scalability, I've drawn a series of graphs. Talking about "scalability" is hard because we translate those numbers into "performance". But the two concepts are unrelated. We say things like "NodeJS is 85% as fast as Nginx", but speed doesn't matter, scalability does. The important difference in those two is how they scale, not how they perform. I'm going to show these graphs in this post.

Monday, February 11, 2013

1996 was the year scalability changed

I'm doing a presentation at Shmoocon this weekend on scalability. I've now realized that it's an 8 hour presentation I'm trying to compress into 50 minutes, so I'm throwing huge gobs of stuff out. One of the things I'd want to discuss is the history of scalability. In particular, I want to go back to 1996. That was the year everything changed.

Back then, dot-coms were buying up Solaris SPARC and SGI MIPS servers as fast as they could. That's because everyone knew that "Wintel" personal computers were toys that couldn't keep up with large problems.

Then, in 1996 Intel shipped the "Pentium Pro" processor (aka. the P6). In addition, Microsoft shipped WinNT 4.0. The combination was faster and more scalable than any competing RISC/UNIX combination. They were also a heck of a lot cheaper.


What made the Pentium Pro different was that it was a radically new design. Intel discarded completely the design of the old Pentium. By translating x86 instructions into internal RISC-like "micro-ops", it got rid of most of the problems of CISC. At the same time, it had numerous architectural improvements that were years ahead of RISC processors in things like super-scalar out-of-order execution and caching. The consequence was that the Pentium Pro was clearly faster on pretty much all benchmarks than all competing RISC processors.

In much the same way, Windows NT was a completely new operating system design. What we call "Windows" was just a backwards compatibility layer, like WINE is on Linux. This new operating system had many futuristic features, like multi-core capabilities, multi-threading, and "IO completion ports". Moreover, Microsoft's web server software that used these capabilities, IIS 4.0, came with the operating system.

(Linux also added SMP support in 1996, but with things like the big kernel lock, it was far behind Windows in actually being useful. The scalable epoll wasn't added until 2002).


I mention this because of the powerlessness of hard numbers. Unix people thought of Windows and Intel in terms of the Windows 95 and the old Pentium processor. This blinded them to the new reality of WinNT and PentiumPro, which were complete ground up redesigns unrelated to their predecessors in anything but name (and backwards compatibility). The Windows people were unhappy as well. The PentiumPro was designed for 32-bit software, and ran the older 16-bit Windows software poorly. Likewise, WinNT wasn't fully backwards compatible with old Windows, especially with games.

Thus, the ground breaking event of the PentiumPro plus WinNT 4.0 went largely unnoticed. The performance was astronomical and the price cheap, yet nobody cared. Dotcoms continued to invest in hugely expensive but underperforming hardware like Solaris SPARC.


The lesson here is about future history. Looking back, it's obvious why Intel won the competition against RISC, but it wasn't obvious back in 1996. Likewise, the superiority of SMP, threading, and scalable polling looks obvious, but it wasn't so back in 1996.

That's what my presentation is about: future-obvious ideas that are presently-obscured. Operating systems like Linux need a fast-path around the kernel for data-plane processing. This is obvious to engineers working on the bleeding edge, but it's still a bit obscure for the mainstream.

Monday, August 20, 2012

Software networks: commodity x86 vs. network processors

“when Alexander saw the breadth of his domain he wept for there were no more worlds to conquer”

The website http://extremetech.com has a great graph showing how commodity Intel x86 processors have overtaken the world, first desktops in the 1980s, then the data center in the 1990s, then supercomputers in the 2000s. So what’s next for Intel to conquer?

There are two answers: mobile (phones, pads) and network appliances. You are probably aware of the first, with the battle raging with Windows8/Android on Intel “Atom” processors competing against ARM processors, but you might not have heard of the second fight.

Friday, May 25, 2012

DNS vs. large memory pages (technical)

As everyone knows, an important threat against the Internet is that of a coordinated DDoS attack against the root TLD DNS servers. The way I'd solve is with a simple inline device that both blocks some simple attacks from hitting the DNS server, but which can also answer simple queries, offloading the main server, even if it's failed. This can be done with $2000, half for the desktop machine, and the other half for the dual-port 10-gig Ethernet.

Tuesday, February 20, 2007

Network Coding, Part 2

[ Part 1 ]

A vuln was discovered in Snort's DCE-RPC reassembly, similar to last year's bug in their SunRPC reassembly. These problems stem from Snort's core architecture. There are two ways of constructing a network applications like intrusion-detection, streaming and backtracking. Snort uses the backtracking model, which is more prone to such mistakes than the streaming model.

In a streaming system, once a byte of input is analyzed, it will no longer be re-analyzed. In a backtracking system like Snort, the technology may go back and re-analyze previous bytes, requiring more complicated reassembly architecture to store those bytes. Streaming models are inherently faster, more reliable, and more secure - but much harder to program.

An intrusion-detection system has a choice whether to use backtracking or streaming technologies. The well-known pattern matching algorithm Boyer-Moore works by skipping ahead, then backtracking, and would be inappropriate for a streaming system. On the other hand, the Aho-Corasick searches for patterns one byte a time, and would work well in a streaming system.

The same applies to more complex pattern-matching using regular-expressions (regex). A regex represents a finite automata. There are two basic ways that a finite automata might work. Using an NFA, all possible combinations of the regex are tested at runtime using backtracking. Using a DFA, all possible combinations are put into a big table, and each streaming byte of input causes a transition to a new state in the table.

Both a backtracking and streaming IDS needs to take care when writing regex expressions to avoid an explosion of possible states. When compiled as an NFA, a hacker can attack the system by causing all states to be traversed. A recent paper shows that a backtracking system like Snort can be DoSed with as little as 4-kbps by causing all backtracking states to be traversed. When compiled as a DFA, the explosion of states will cause all memory to be consumed when compiling the regex - what looks like a simple regex can, in fact, require a DFA of 5-gigabytes to store all combinations.

The streaming model can be used for protocol-analysis as well as pattern-matching. There are not many examples in the open-source community, but a good one can be found in Mozilla's GIF parser (function gif_write() in GIF2.cpp). This code parses the GIF format one byte at a time as the image is streamed from the web-server so that it can render it in on the screen before the file has been completed downloaded. Since each byte is processed individually, each incoming fragment of data is processed by itself rather than being reassembled.

The Mozilla GIF parser looks almost identical to the GIF parser I wrote for the Proventia IDS/IPS. Its structure is similar to all the other 200-odd protocol decodes in Proventia, including the SMB and DCE-RPC parsers. These parsers decode the protocols as a stream of bytes.

Since all the logic in Proventia is stream oriented, it does not actually "reassemble" fragments, it just "reorders" them. When one fragment ends and the other starts, it continues where it left off as if there were no fragment break. The TCP protocol delivers a series of ordered fragments to the NetBIOS/SMB decode, which itself delivers a series of ordered fragments to the DCE-RPC decode, which delivers a series of ordered fragments to the application decodes on top of DCE-RPC. The simplicity of this approach is why Proventia has had SMB and DCE-RPC "reassembly" in the core engine as far back as 2000, even though the major DCE-RPC vulnerabilities weren't discovered until 2003 (in contrast, Snort added DCE-RPC reassembly in 2006).

I talked about ASICs in Part 1 of this series. As Chief Scientist of ISS, I had ASIC vendors come to me with proposals to accelerate TCP reassembly and regex pattern-matching. Not only were their proposals slower than our shipping products, but they had a hard time grasping the concepts that (a) TCP reassembly isn't really needed, and (b) their methods of accelerating regex by converting to a DFA can be done in software without their ASIC.

I have talked to engineers at Ironport (an e-mail appliance) and Sidewinder (a firewall). They have indicated that they use the same approach in their products. Like Proventia, they are the fastest in their class of products. Even Microsoft's IIS uses a streaming model. For example, when sending a "GET /index.html HTTP/1.0", you can send 5-billion spaces between the "GET" and the "/index.html". This is because Microsoft is using a state-machine to parse the incoming bytes from TCP. In contrast, Apache reads in a block of 16k bytes, then backtracks to re-parse the boundary between "GET" and "/index.html".

Monday, February 19, 2007

High-performance security appliances

See [ Part 2 ] of this series.

I was reading this blog post about ASICs. ASICs are like herbs. While scientists cannot find any benefits to having herbs in shampoo, the public widely believes they make a difference. Therefore, it's impossible to find shampoo that doesn't have herbs in it.

There are lots of firewalls, e-mail appliances, and intrusion-prevention products that don't have herbs^H^H^H^H ASICs, which yet manage to have good performance. For example, the my (former) Proventia product could handle 5-gbps of real-word traffic with 30-microseconds of latency. And it does this without taking shortcuts. When you look under the hood of ASIC-based systems, you'll find that it's not the ASIC that made them fast, but some sort of sacrifice they've made (such as not analyzing HTTP responses).

To make code run as fast as ASICs, we have to use special techniques. For example, imagine writing a high-performance DNS server. When packets arrive in a buffer, they reset the cache flags. Reading the first bytes of the packet will cause a cache-miss, which causes the processor to halt for 300-cycles. Likewise, when resolving a random name, the name is unlikely to be in the cache, which is another 300-cycle hit. On a multi-processor system, threading locks require hard bus transactions, which can be as much as a 600-cycle hit.

When trying to process 3-million requests per second on a 3-GHz x86 processor, you have only 1000-cycles. The above requirements would seem to indicate that you need a minimum 1200-cycles, but you can use tricks to get past this. When processing an incoming request, you can execute a cache-prefetch instruction on the next packet. Thus, that packet will already be in the cache by the time you get to it, thus avoiding cache misses when you start processing packets. Likewise, when you get to the DNS name, instead of reading the table entry, you can execute a prefetch on it, then continue processing the previous packet. Lastly, instead of using normal synchronization primitives that lock the bus, you can construct the code with producer-consumer queues that don't require bus locking.

Thus, with careful coding, you can get rid of all the processor stalls.

Network ASICs from Cavium, RMI, Consentry, etc. solve the problem of processor stalls a different way. They are aggressively multi-threaded, so that when the processor stalls on one thread, they continue executing a different thread. They will help network applications that have frequent stalls (e.g. Snort), but would have no benefit on code that has engineered around the stalls (e.g. Proventia). At ISS, we jokingly referred to all the "hardware accelerators" as "decelerators".

These chips are useful for power consumption, but even there Intel has almost caught up with the Core 2 Duo, and is likely even to surpass them later this year with their 45nm process with a hafnium dielectric.