Tuesday, March 12, 2013
Code, philosophy, integer overflows
A college professor has created a cool contest to teach people about “integer overflows”. Integer overflows are a common bug that allows hackers to break into computers. Even without hackers, it’s a subtle bug that most C programmers don’t know – but should.
This post discusses my entry to the contest. It’s only of interest to C coding geeks who write parsers.
The contest requirements are in a blogpost at http://blog.regehr.org/archives/909. In summary, the rules are to parse a valid “long” integer, reject invalid input, and NOT EVER do any integer overflow, even in an intermediate result. All the submissions and a test program are located on GitHub https://github.com/regehr/str2long_contest. Among submissions by other people, you'll find my own “robert_2.c”.
The problem with software is that coders start the low-level implementation before thinking of the high-level design. That’s what we see here: low-level implementations without a clear abstract, high-level design.
To be fair, the reason has more to do with the contest than the coders. The contest was defined in terms of low-level C, so it’s natural that people submitted low-level C in response.
The point I’m trying to make with my code is that coders should stop being so specific to C. Code should be as language neutral as possible. They need to stop casting things so much. They need to stop doing so much pointer math; they should index arrays instead. The only time low-level C weirdness is needed is spot optimizations, and even in those cases, coders should consider assembly language or “intrinsics” to get at even a lower level of code.
Since my code is high-level, does it perform worse than the low-level implementations? The rough answer is that it should be just as fast.
There are two things to be aware of. The first is how CPUs execute machine code. The second is how C compilers translate high-level code into assembly.
Virtually all submissions have the same inner calculation of multiplying the result variable by 10, then adding the next digit. This translates to the “imul” instruction, which stands for “integer multiplication”. Whereas addition instructions take only 1 clock cycle, multiplication is more expensive, and takes 3 clock cycles to complete. But, CPUs can execute 3 multiplications at the same time. Thus, while the “latency” for any single multiplication is 3 clock cycles, the “throughput” is one completed multiplication per clock cycle.
That’s for the “Sandy Bridge” class processors. Intel has a low-power x86 variant called Atom where multiplications are much more expensive, completing a multiplication only once every 11 clock cycles.
I point out the Atom case because that’s actually what today’s compilers target. Code compiled for the Atom tends to run well on Sandy Bridge, but the reverse isn’t true. The “imul” instruction is one example.
Compilers can change a “multiply by 10” instruction into a “multiply by 8 and add twice”. In other words, instead of “x = x*10”, you get “x = x*8 + x + x”. Since 8 is a power of 2, that means you can replace the multiplication with a logical shift instruction. That means the equation now becomes “x = x<<3 p="" x="">
Intel x86 has the totally awesome instruction called “lea” that combines a shift/multiplication plus an addition for precisely this situation. This allows us to execute the following two assembly language instructions (in Intel assembler format ‘cause I hate GCC assembler format):
lea x, [x*8 + x]
Each of these takes only a single clock cycle, so combined they take 2 clock cycles. This is slightly faster than the 3 clock cycle “imul” on Sandy Bridge, and much faster than the 11 clock cycles on Atom.
Moreover, it doesn’t really even take 2 clock cycles. Sandy Bridge can execute 4 instructions at once, and Atom can execute 2 instructions at once. The only caveat is that instructions must not depend on each other. Thus, these two instructions that are dependent on each other can be interleaved with other instructions that aren’t.
This is precisely what happens on Microsoft’s C compiler optimizing the “robert_2.c” code. It decomposed the multiplication into an LEA and ADD instruction, then interleaved them with other instructions.
The point I’m trying to make here is that both compilers and CPUs are very good at dealing with high-level code. As a coder, you know that multiplication is slower than addition, but you should largely ignore this fact and let the compiler optimize things for you.
The same applies to indexing arrays. Incrementing a pointer is a single operation, but incrementing an index and adding to a pointer is two operations. While this is true, things are different once the code is executed. Therefore, as a coder, you should generally just focus on the high-level code. The only time you should start doing low-level optimizations in C is when you’ve identified hot-spots in your code.
So, I’ve discussed how code should be written at a “high-level” and that you should let the compiler and CPU worry about how best to execute at the “low-level”. In this section, I’m going to attempt to “prove” this with benchmarks, showing how my high-level code performs at essentially the same speed as low-level code.
I downloaded the entire contest project, compiled it, and ran on my laptop. As it turns out, my “robert_2.c” code was twice as slow as the fastest result. This is within the range of meaningless compiler optimizations – simply changing compiler optimizations and versions might easily reverse the results. But still, the slowness bothered me.
Upon further investigation, I found there might be reasons for the speed differential. The test program did two things at once: test for “correctness” and test for “speed”. The consequence is that most of the input was abnormal. For example, almost all the numbers had leading zeroes.
My “robert_2.c” code doesn’t check for abnormalities explicitly. Instead, things like leading zeroes are handled implicitly – and slightly slower. Therefore, I made an update to the code called “robert_3.c” (https://gist.github.com/robertdavidgraham/5136591) that adds explicit checking. I don’t like it as much, but it’ll test my claim that it’s the test input that is at fault.
And indeed, whereas “robert_2.c” was twice as slow as the leading contender, “robert_3.c” was only 10% slower. Clearly, optimizing the code to target the test input is a win.
But instead of changing the parsing code, we should change the benchmark program. It shouldn’t be testing speed based on erroneous input. It should test speed using expected, normal input. Therefore, I changed the benchmark. Instead of using the sprint() format string of “%030ld” to format the input integers, I changed it to “%ld”, getting rid of the leading zeroes. I also got rid of all the erroneous input.
This is impossible. If there is no leading zeroes and no errors in the input in my new benchmark, then “robert_2.c” should be faster in every case than “robert_3.c”. But the reverse is true, with “robert_3.c” being 30% quicker than “robert_2.c”.
This proves my point at the top: otherwise meaningless changes in the build/runtime environment make more difference to the benchmarks than the code itself. Even when there is a 2:1 difference in speed, this is within the range of compiler weirdness, where different compilers, or meaningless changes to the code, can completely flip performance around. Also, if you look at the "stefan" algorithm, you'll find that it's essentially the same as my algorithm, except it uses pointer-arithmetic instead of indexed arrays, thus proving that indexing isn't slower.
There are indeed a few submissions that have clear performance problems, but all the rest perform the same, within the margin of error of compiler weirdness. Even when there is a 2 to 1 difference, it’s still the “same” performance. The only way to truly differentiate the code by speed is to do a wide range of tests using many compilers and many different CPUs, using many kinds of inputs.
Problems with the contest
While an awesome idea, I do want to point out problems with the contest.
The classic problem in parsing is the confusion between “internal” and “external” formats. This contest extends that confusion. Internal types like “long” are abstract and should change from platform to platform. External formats like strings are constant, and don’t change depending on platform. A PDF file is a PDF file regardless if its being parsed on a 32-bit or 64-bit computer. It shouldn’t suddenly fail to parse simply because you’ve copied it to your 32-bit Android phone.
Thus, the contest should clearly define the external format without depending upon abstract internal concepts. This specification could be something like “a 2s complemented signed 64-bit number” or “a number between -9223372036854775808 and 9223372036854775807, inclusive”. This means the internal function prototype changes to something like “int64_t str2long()”.
Everyone assumes the ASCII character set, where we check things a character being between ‘0’ and ‘9’, because in ASCII, ‘0’ is less than ‘9’. Whereas ASCII digits go “0123456789”, you can imagine some other character set being “1234567890”. Since in this new character set the value of ‘0’ is now greater than ‘9’, most all the submitted code will fail. One submission, mats.c, tries to fix this. It actually doesn’t. The problem here is again “internal” vs. “external” format. The internal character set used to write the code may differ from the external character set used to express the number, so Mat’s code still breaks if the compiler expects EBCDIC and the input is ASCII. The absolutely correct way to handle this is therefore to specify that the external format as “ASCII” and for everyone to change their code to use integers like 48 instead of character constants like ‘0’. My code depends upon ASCII because even though the official standard allows any character set, it’s wrong, the de facto standard is that there is only one standard, ASCII.
The contest used a global “errno” type value to signal an error. Globals like this are bad in even single threaded code, and become a nightmare in multi-threaded code. The contest should’ve signaled errors as another parameter, like “int64_t str2long(const char *s, int *error)”.
The contest is very Linux-specific. In order to run on my MacBook, I had to port the clock_gettime() function, because CLOCK_MONOTONIC isn’t supported on the Mac. He should’ve just used the ANSI standard “clock()” function, and the code would work everywhere. Sure, there is a good debate to be had about “wall-clock” vs. “system-clock” (another common problem in code along with integer-overflows and external-internal-formats), but he gets the wrong answer anyway. The proper Linux clock for this is “CLOCK_MONOTONIC_RAW”, as NTP can cause the clock to skew for mere “CLOCK_MONOTONIC”.
It took me 5 minutes to write, test, and submit the actual code. It’s taken me hours to write up the thinking behind it. That’s because I’m more of a philosophical coder, and I’ve spent a lot of time thinking about all these issues, even though it takes only minutes to apply these thoughts.
Software design and language-neutral code is good. Gotos, global variables, #defines, pointer-arithmetic are all bad. Parsing should follow the external/internal pattern. Portability is good, platform specific code is bad. Except for obvious performance problems, you can’t really compare the speed of code without a lot of work.