Sunday, February 17, 2013

Scalability: it's the question that drives us

In order to grok the concept of scalability, I've drawn a series of graphs. Talking about "scalability" is hard because we translate those numbers into "performance". But the two concepts are unrelated. We say things like "NodeJS is 85% as fast as Nginx", but speed doesn't matter, scalability does. The important difference in those two is how they scale, not how they perform. I'm going to show these graphs in this post.

Consider the classic Apache web server. We typically benchmark the speed in terms of "requests per second". But, there is another metric of "concurrent connections". Consider a web server handling 1,000 requests-per-second, and that it takes 5 seconds to complete a request. That means at any point in time, on average, there will be 5,000 pending requests, or 5,000 concurrent connections. But Apache has a problem. It creates a thread to handle each connection. The operation system can't handle a lot of threads. So, once there are too many connections, performance falls off a cliff.

This is shown in the graph below. This is a hypothetical graph rather than a real benchmark, but it's about what you'll see in the real world. You'll see that around the 5000 connections point, performance falls off a cliff.

Let's say that you are happy with the performance at 5000 connections, but you need the server to support up to 10,000 connections. What do you do? The naive approach is to simply double the speed of the server, or buy a dual-core server.

But this naive approach doesn't work. As shown in the graph below, doubling performance doesn't double scalability.

While this graph shows a clear doubling of performance, it only shows an increase of about 20% in terms scalability, handling about 6000 connections.

The same is true if we increase performance 4 times, 8 times, or even 16 times, as shown in the graph below. Even using a server 16 times as fast as the original, we still haven't even doubled scalability.

The solution to this problem isn't faster hardware, but changing the software to scale. Consider some server software that is a lot slower than Apache, but whose performance doesn't drop off so quickly when there are lots of connections to the server. The graph would look like the orange line in the following graph:

This orange line could be a server running NodeJS on my laptop computer. Even though it's slower than a real beefy expensive 32-core server you spent $50,000, it'll perform faster when you've got a situation that needs to handle 10,000 concurrent connections.

The point I'm trying to make is that "performance" and "scalability" are orthogonal problems. When trying to teach engineers how to fix scalability, their most common question is "but won't that hurt performance". The answer is that it almost doesn't matter how much you hurt performance as long as it makes the application scale.

I've encapsulated all this text into the following picture. When I have a scalability problem, I can't solve it by increasing performance, as shown by all the curvey lines. Instead, I have to fix the problem itself -- even if it means lower performance, as shown by the orange line. The moral of the story is that performance and scalability are orthogonal. This is the one picture you need to remember when dealing with scalability.

I use the "notebook computer running NodeJS" as an example because it's frankly unbelievable. When somebody has invested huge amounts of money in big solutions, they flat out refuse to believe that a tiny notebook computer can vastly outperform them at scale. In the dumber market (government agencies, big corporations) the word "scale" is used to refer to big hardware, not smarter software.

In recent years, interest in scalable servers have started to grow rapidly. This is shown in the following Netcraft graph of the most popular web servers. Apache still dominates, but due to its scalability problems, it's dropping popularity. Nginx, the most popular scalable alternative, has been growing rapidly.

Nginx is represented by the green line the above graph, and you scan see how it's grown from nothing to becoming the second most popular web server in the last 5 years.

But even Nginx has limits. It scales to about 100,000 concurrent connections before operating system limits prevent further scalability. This is far below what the hardware is capable of. Today's hardware, even heap desktop computers, can scale to 10 million connections, or 100 times what Nginx is practically capable of. I call this the "The C10M Problem", referring to how software is still far below the capability of the hardware.


Talking about scalability is hard. It's easier to understand if you can visualize it. That's why I put together these graphs. I want to use these graphs in future presentations. But, if I store them on my hard disk, I'm likely to lose them. Therefore, I'm posting these to the intertubes so that I can just use the google to find them later. You are free to use these graphs too, if you want, with no strings attached, though I wouldn't mind credit when it's not too much trouble.


Wes Felter said...

I find it easier to read with requests/s on the x axis and non-error responses/s on the y axis. 99th percentile latency vs. requests/s is another good metric. I don't care about concurrency per se, just supportable load.

EJ campbell said...

It seems wrong to be talking about scalability absent from looking at the work actually being done for each request. Node can be scalable up the wazoo, but if there's 5 seconds of CPU time per request, the solution is not scalable. The only time a solution like that makes sense is if your clients are mostly idle.

Anonymous said...

Did you actually try NodeJS in a real environment, or do you believe their marketing crap?

Robert Graham said...

In response to the post above:

Have I tried NodeJS? Yes. I've written an app on NodeJS and tested its scalability vs. Apache. It scales vastly better.

The only people I've found that dislike NodeJS are those who don't have scalability problems.

Of course, NodeJS has lots of other problems. It'll happily run out of memory and then fall over. I'm not sure that I would use it for most problems. I just use it as an example of something that, unlike Apache, does scale.

Anonymous said...

"It'll happily run out of memory and then fall over."
Seriously? Then it does not scale.

The technology behind NodeJS is not new, it exists for tons of other languages. Why not talk about them, instead?

NodeJS is reinventing the wheel, except badly.

mike said...

It would be interesting to see how the libdispatch version of Apache fares in similar tests.
It currently only runs on OS X or FreeBSD and scoreboard doesn't work but it might make a huge difference to the overhead.

Tom said...

The concept is correct, but in some places you should probably use the word capacity. For example: "Even using a server 16 times as fast as the original, we still haven't even doubled scalability". It should end with "haven't even doubled capacity". This is a scalability problem. Scalability isn't a quantity you can double. It's a property (i.e., a quality). The definition I use is: The ability to add resources to a system in order to attain or sustain a desired performance level. I'm sure that requires modification in some other contexts. But your point is still correct.

jambu said...

It would be interesting to see how the libdispatch version of Apache fares in similar testsme ...!!!

Hugo said...
This comment has been removed by a blog administrator.