Tuesday, March 12, 2013

Code, philosophy, integer overflows

A college professor has created a cool contest to teach people about “integer overflows”. Integer overflows are a common bug that allows hackers to break into computers. Even without hackers, it’s a subtle bug that most C programmers don’t know – but should.

This post discusses my entry to the contest. It’s only of interest to C coding geeks who write parsers.

The contest

The contest requirements are in a blogpost at http://blog.regehr.org/archives/909. In summary, the rules are to parse a valid “long” integer, reject invalid input, and NOT EVER do any integer overflow, even in an intermediate result. All the submissions and a test program are located on GitHub https://github.com/regehr/str2long_contest. Among submissions by other people, you'll find my own “robert_2.c”.

High-level C

My submission is very different from the others. It doesn’t have a #define, (cast), or goto. Instead of incrementing the string pointer *(s++) my code increments an index to the byte array s[i++]. In fact, it’s not C so much as language-neutral code. With minor changes, it could be Java, JavaScript, Lua, VisualBasic, and so on. It’s not just language syntax but mathematics and logic – most submissions parse a negative number by first parsing the “unsigned” version, then change it at the end to signed. That wouldn’t work on Java, because Java has no “unsigned” type. And since negative values can be larger than positive ones (e.g. 8-bit values range from -128 to +127), an integer overflow would occur.

The problem with software is that coders start the low-level implementation before thinking of the high-level design. That’s what we see here: low-level implementations without a clear abstract, high-level design.

To be fair, the reason has more to do with the contest than the coders. The contest was defined in terms of low-level C, so it’s natural that people submitted low-level C in response.

The point I’m trying to make with my code is that coders should stop being so specific to C. Code should be as language neutral as possible. They need to stop casting things so much. They need to stop doing so much pointer math; they should index arrays instead. The only time low-level C weirdness is needed is spot optimizations, and even in those cases, coders should consider assembly language or “intrinsics” to get at even a lower level of code.

Low-level code

Since my code is high-level, does it perform worse than the low-level implementations? The rough answer is that it should be just as fast.

There are two things to be aware of. The first is how CPUs execute machine code. The second is how C compilers translate high-level code into assembly.

Virtually all submissions have the same inner calculation of multiplying the result variable by 10, then adding the next digit. This translates to the “imul” instruction, which stands for “integer multiplication”. Whereas addition instructions take only 1 clock cycle, multiplication is more expensive, and takes 3 clock cycles to complete. But, CPUs can execute 3 multiplications at the same time. Thus, while the “latency” for any single multiplication is 3 clock cycles, the “throughput” is one completed multiplication per clock cycle.

That’s for the “Sandy Bridge” class processors. Intel has a low-power x86 variant called Atom where multiplications are much more expensive, completing a multiplication only once every 11 clock cycles.

I point out the Atom case because that’s actually what today’s compilers target. Code compiled for the Atom tends to run well on Sandy Bridge, but the reverse isn’t true. The “imul” instruction is one example.

Compilers can change a “multiply by 10” instruction into a “multiply by 8 and add twice”. In other words, instead of “x = x*10”, you get “x = x*8 + x + x”. Since 8 is a power of 2, that means you can replace the multiplication with a logical shift instruction. That means the equation now becomes “x = x<<3 p="" x="">
Intel x86 has the totally awesome instruction called “lea” that combines a shift/multiplication plus an addition for precisely this situation. This allows us to execute the following two assembly language instructions (in Intel assembler format ‘cause I hate GCC assembler format):
lea x, [x*8 + x]
add x,x
Each of these takes only a single clock cycle, so combined they take 2 clock cycles. This is slightly faster than the 3 clock cycle “imul” on Sandy Bridge, and much faster than the 11 clock cycles on Atom.

Moreover, it doesn’t really even take 2 clock cycles. Sandy Bridge can execute 4 instructions at once, and Atom can execute 2 instructions at once. The only caveat is that instructions must not depend on each other. Thus, these two instructions that are dependent on each other can be interleaved with other instructions that aren’t.

This is precisely what happens on Microsoft’s C compiler optimizing the “robert_2.c” code. It decomposed the multiplication into an LEA and ADD instruction, then interleaved them with other instructions.

The point I’m trying to make here is that both compilers and CPUs are very good at dealing with high-level code. As a coder, you know that multiplication is slower than addition, but you should largely ignore this fact and let the compiler optimize things for you.

The same applies to indexing arrays. Incrementing a pointer is a single operation, but incrementing an index and adding to a pointer is two operations. While this is true, things are different once the code is executed. Therefore, as a coder, you should generally just focus on the high-level code. The only time you should start doing low-level optimizations in C is when you’ve identified hot-spots in your code.

Benchmarking code

So, I’ve discussed how code should be written at a “high-level” and that you should let the compiler and CPU worry about how best to execute at the “low-level”. In this section, I’m going to attempt to “prove” this with benchmarks, showing how my high-level code performs at essentially the same speed as low-level code.

I downloaded the entire contest project, compiled it, and ran on my laptop. As it turns out, my “robert_2.c” code was twice as slow as the fastest result. This is within the range of meaningless compiler optimizations – simply changing compiler optimizations and versions might easily reverse the results. But still, the slowness bothered me.

Upon further investigation, I found there might be reasons for the speed differential. The test program did two things at once: test for “correctness” and test for “speed”. The consequence is that most of the input was abnormal. For example, almost all the numbers had leading zeroes.

My “robert_2.c” code doesn’t check for abnormalities explicitly. Instead, things like leading zeroes are handled implicitly – and slightly slower. Therefore, I made an update to the code called “robert_3.c” (https://gist.github.com/robertdavidgraham/5136591) that adds explicit checking. I don’t like it as much, but it’ll test my claim that it’s the test input that is at fault.

And indeed, whereas “robert_2.c” was twice as slow as the leading contender, “robert_3.c” was only 10% slower. Clearly, optimizing the code to target the test input is a win.

But instead of changing the parsing code, we should change the benchmark program. It shouldn’t be testing speed based on erroneous input. It should test speed using expected, normal input. Therefore, I changed the benchmark. Instead of using the sprint() format string of “%030ld” to format the input integers, I changed it to “%ld”, getting rid of the leading zeroes. I also got rid of all the erroneous input.

Things didn’t turn out as I expected, though. My error-checking “robert_3.c” was still faster than “robert_2.c” even though I’d removed all errors from the input. Indeed, the “robert_3.c” file was the second fastest of all the submissions. (Screenshot to the right)

This is impossible. If there is no leading zeroes and no errors in the input in my new benchmark, then “robert_2.c” should be faster in every case than “robert_3.c”. But the reverse is true, with “robert_3.c” being 30% quicker than “robert_2.c”.

This proves my point at the top: otherwise meaningless changes in the build/runtime environment make more difference to the benchmarks than the code itself. Even when there is a 2:1 difference in speed, this is within the range of compiler weirdness, where different compilers, or meaningless changes to the code, can completely flip performance around. Also, if you look at the "stefan" algorithm, you'll find that it's essentially the same as my algorithm, except it uses pointer-arithmetic instead of indexed arrays, thus proving that indexing isn't slower.

There are indeed a few submissions that have clear performance problems, but all the rest perform the same, within the margin of error of compiler weirdness. Even when there is a 2 to 1 difference, it’s still the “same” performance. The only way to truly differentiate the code by speed is to do a wide range of tests using many compilers and many different CPUs, using many kinds of inputs.

Problems with the contest

While an awesome idea, I do want to point out problems with the contest.

The classic problem in parsing is the confusion between “internal” and “external” formats. This contest extends that confusion. Internal types like “long” are abstract and should change from platform to platform. External formats like strings are constant, and don’t change depending on platform. A PDF file is a PDF file regardless if its being parsed on a 32-bit or 64-bit computer. It shouldn’t suddenly fail to parse simply because you’ve copied it to your 32-bit Android phone.

Thus, the contest should clearly define the external format without depending upon abstract internal concepts. This specification could be something like “a 2s complemented signed 64-bit number” or “a number between -9223372036854775808 and 9223372036854775807, inclusive”. This means the internal function prototype changes to something like “int64_t str2long()”.

Everyone assumes the ASCII character set, where we check things a character being between ‘0’ and ‘9’, because in ASCII, ‘0’ is less than ‘9’. Whereas ASCII digits go “0123456789”, you can imagine some other character set being “1234567890”. Since in this new character set the value of ‘0’ is now greater than ‘9’, most all the submitted code will fail. One submission, mats.c, tries to fix this. It actually doesn’t. The problem here is again “internal” vs. “external” format. The internal character set used to write the code may differ from the external character set used to express the number, so Mat’s code still breaks if the compiler expects EBCDIC and the input is ASCII. The absolutely correct way to handle this is therefore to specify that the external format as “ASCII” and for everyone to change their code to use integers like 48 instead of character constants like ‘0’. My code depends upon ASCII because even though the official standard allows any character set, it’s wrong, the de facto standard is that there is only one standard, ASCII.

The contest used a global “errno” type value to signal an error. Globals like this are bad in even single threaded code, and become a nightmare in multi-threaded code. The contest should’ve signaled errors as another parameter, like “int64_t str2long(const char *s, int *error)”.

The contest is very Linux-specific. In order to run on my MacBook, I had to port the clock_gettime() function, because CLOCK_MONOTONIC isn’t supported on the Mac. He should’ve just used the ANSI standard “clock()” function, and the code would work everywhere. Sure, there is a good debate to be had about “wall-clock” vs. “system-clock” (another common problem in code along with integer-overflows and external-internal-formats), but he gets the wrong answer anyway. The proper Linux clock for this is “CLOCK_MONOTONIC_RAW”, as NTP can cause the clock to skew for mere “CLOCK_MONOTONIC”.


It took me 5 minutes to write, test, and submit the actual code. It’s taken me hours to write up the thinking behind it. That’s because I’m more of a philosophical coder, and I’ve spent a lot of time thinking about all these issues, even though it takes only minutes to apply these thoughts.

Software design and language-neutral code is good. Gotos, global variables, #defines, pointer-arithmetic are all bad. Parsing should follow the external/internal pattern. Portability is good, platform specific code is bad. Except for obvious performance problems, you can’t really compare the speed of code without a lot of work.


Toby said...

I agree with much of your "philosophy". My entry (toby.c) is also "high level" in a similar sense to yours (with similar performance, at least on my box when running Regehr's str2long_test).

That my code should perform similarly to yours again highlights how much this question of performance here depends on the local setup (compiler etc.): my code repeatedly tests the same conditional unnecessarily, which yours avoids.

My code also differs from many of the other solutions because it documents the reasoning behind why it is correct.

This is unsurprising, since I'm (these days) working in software verification -- the comments in my code document the thought process I had to go through to convince myself why it works without overflow etc. (This is probably a more involved thought process than others may require, because this is the first bit of C code I've written since mid-2006.) I then formalised that reasoning by verifying the absence of signed overflow in my submission using our local C verification technology.

On that-- our formal C semantics rejects lots of things that are present in many of the other solution, such as assignments with side-effects, like
i = i++;
because the compiler is free to decide exactly in what order the side-effecting operations occur, so giving such statements a formal semantics that will survive a particular compiler version is non-trivial.

Interestingly, robert_2.c is one of the (few -- only 7) working submissions that can be given a formal C semantics by our local tool. I suspect this is because your code, like mine, was purposefully written to be "high level" rather than exploiting low level C features (like side-effecting assignments etc.) that are more difficult to give formal semantics to.

So writing "high level" C code also assists with formally reasoning about it -- because its semantics are more likely to be independent of the particular compiler you're using and the platform you're compiling for.

Matt said...

Hi Robert, thanks for taking the time to write this up. I have learned a lot reading you blog posts. Your solution to the contest and discussion was interesting, but I was more interested in something more mundane. It looks like you are using OS X on your Mac. I seem to recall you using MS Windows on you Mac, last time I paid attention. Why the change?

Robert Graham said...

Lol, I use both on my Mac. My first submission (robert.c) was written in 'vi' on Mac OS, because that's what I had booted at the time. Then, when I had a bug (didn't handle the empty string), I was using Win7 on the Mac, so the second submission was edited/compiled on VisualStudio 2010, which is why the blog post mentions what how Microsoft C optimizes things.

These days, I use both Mac and Win equally on the MacBook Air. But, I stress that I use Windows in order to troll Mac lovers :).

Mark said...

You should probably change the destination to y for lea and add, as add x, x =~ x *= 2

KrzaQ said...

Everyone assumes the ASCII character set, where we check things a character being between ‘0’ and ‘9’, because in ASCII, ‘0’ is less than ‘9’. Whereas ASCII digits go “0123456789”, you can imagine some other character set being “1234567890”. Since in this new character set the value of ‘0’ is now greater than ‘9’, most all the submitted code will fail.

I disagree. If you read the C language standard, at 5.2.1/3 you can see the following guarantee (this is actually from N1570):

In both the source and execution basic character sets, the
value of each character after0in the above list of decimal digits shall be one greater than
the value of the previous.

Robert Graham said...

Thanks KrzaQ. I vaguely remembered something like that I quickly scanned the standard to confirm it, but missed that.

Though, the standard is wrong. External data has no character set, it's an undifferentiated sequence of bytes.

renoX said...

I find very interesting that your "high-level" implementation is so fast.
That said
1) instead of dividing by 10 you multiply by 10, so your code has also a 'low level' optimization..
2) in C array index should be of type size_t not int if memory serves (unless you know that the array is small).
3) While the API "int64_t str2long(const char *s, int *error)" is much better than the contest's one (the name should be updated though) , I don't like mixing input and outputs in function parameters, how about returning a struct{int error; int_64 result;} ?