Thursday, February 27, 2014

C programming: you are teaching it wrong

It's been three decades. There is no longer an excuse for the fail way colleges teach "C programming". Let me help.


Chapter 1: the debugger


C programming starts and ends with the debugger. Before they write a line of code of their own, students need to be comfortable single stepping line-by-line through source, viewing local variables, and dumping memory.

A good first assignment is run the following program, and have the student report the values for 'a' and 'b', which can only be gotten by stepping through the code in a debugger.

int main() {
    int a = rand();
    int b = rand();
    printf("a + b = %d\n", a + b);
    return 0;
}

The "printf()" function is not a debugger. If that's how you debug your code most of the time, you are doing it wrong.

GDB is not an adequate debugger. The reason people rely upon "printf()" is because GDB is too difficult. Even the "TUI" interface is inadequate.

The debugger is not your "last resort", the thing your struggle with when there is no other way to fix a bug. Instead, the debugger is the "first resort": even when your program works correctly, you still use the debugger to step through your code line-by-line, double checking variables, making sure that it's behaving the way you expect.


Microsoft's VisualStudio and Apple's XCode both have excellent debuggers. I haven't used Eclipse much, but it looks adequate. The problem with these is that they also require you to create a complicated "project" that manages everything. That's a big hurdle for small one-file programs (like the example above), but students have to learn to deal with the overhead of these "projects".

To repeat: unless you have an adequate, easy-to-use source-level debugger, you really shouldn't be programming in C. Any course on C needs to start with the debugger, before even teaching students to code.

Chapter 2: smashing the stack for fun and profit


C is an inherently dangerous language. When a bug writes outside it's buffer, it's not immediately caught, as it would be in Java or other high-level languages. Instead, memory gets corrupted. It's only later in the program, sometimes much later, when that corrupted memory gets used, that the program crashes. This teaches students that bugs happen by magic, and are deep impenetrable mysteries that no mortal can understand.

Students need to be taught that they have no training wheels, that C will happily corrupt memory with no error indication (until later). They need to be told upfront that, unlike Java, when the program crashes, the line-number in the code is usually not the code that's at fault.

In particular, students need to learn the "stack frame" and "heap" structures to the same level of detail in the document "Smashing the Stack for Fun and Profit" [*]. Before even teaching students the syntax for calling a function, students need to learn that there is a structure of data for every function call. When a function crashes on return, they need to be able to dump the stack memory and find out what happened. They need to be familiar with how parameters are pushed on the stack, then return address, then local variables. The need to watch, in a debugger, as this happens. Students need to learn that memory corruption isn't a mystery, but something deterministic that they can trace back and solve.

Teachers of C avoid these difficult technical details, but that does a disservice to the student. The student's first bug is going to be stack/heap corruption crashing on the wrong line of code, and they'll learn that solving bugs in C is hopeless.

This would also be a good time to teach students bounds-checkers like Valgrind. It adds the training wheels missing from other languages.

Chapter 3: strcpy_s()/strlcpy()


A decade after "buffer-overflow" worms ravaged the Internet, professors are still telling their students to use the functions that caused the worms, like strcpy() and sprintf(). This needs to stop.

We have safer replacement functions. Microsoft has created standard safe functions, like  strcpy_s()and sprintf_s() that have been adopted by the standards committee. Or, since GCC doesn't really support this standard yet, you can teach functions like strlcpy() and snprintf() instead.

Students should be taught that using the old, non-bounds-checking functions, shouldn't be even an option when writing code. C is an inherently dangerous language -- the early students learn to program as if C were dangerous, the better.

Chapter N+1: internal vs. external data


What makes C different from other languages is that it allows you to access raw memory. Just because you can do this doesn't mean you should.

Typical C paradigm is to have a point 'p', then do arithmetic on the pointer, such as incrementing it in order to enumerate objects in an array, something like this:

    for (p=start; p<end; p++)
        printf("foo = %s\n", p->foo);

This is bad. In general, C should be programmed like any other language, where an index variable enumerates an array:

    for (i=0; i<count; i++)
        printf("foo = %s\n", p[i].foo);
 
When parsing a two-byte integer from external input, a C programmer is taught to the do the following:

    x = *(short*)p;

This is bad. Historically, it has meant that RISC processor crashes unexpected on unaligned data. On Intel processors, teaching this method has led to unending confusion about "byte-order" (aka. "endianess"). The correct method to teach students is the following:

    x = p[0]*256 + p[1];

It's the same way that you'd extract an integer using any other language, and it doesn't need those pesky "noths()" macros to swap bytes.

Even though C doesn't enforce it, students still need to learn that "internal" data should be kept separate from "external" data. They have to be aware of what C is doing internally, but they should mess with it. They shouldn't "cast" data structures on top of external data, and they shouldn't use pointer arithmetic for everything. When quickly glancing at the code, it should look similar to Java or JavaScript.

Conclusion


In other engineering disciplines, you learn failure first. If you are engineering bridges, you learn why the Tacoma Narrows failed. If you are architecting skyscrapers, you learn why the World Trade Center fell. If you are building ships, you learned why the Titanic sank.

Only in the discipline of software engineering is failure completely ignored. Even after more than a decade of failure, students are still taught to write software as if failure is unlikely to be a threat they ever face. This is wrong in general, but especially wrong with the C programming language. It's not that students need to "take security seriously" and spend all their time learning every rare hacking technique in code, but their education should include the basics, like those outlined above.




By the way, this post comes from me spending some time on a college campus this week. Travis Goodspeed, Sergey Bratus, and I were watching a student struggle with a bug in C. For example, the student knew what line was crashing because she put a "printf()" right before that point. I want to strangle whichever professor was teaching the class that didn't teach the students to use a debugger. Some of this post reflects some of their comments in our discussion.

9 comments:

WT said...

I would've appreciated taking your "course" instead of the one I took instead. My CS program (mid to late 90s) thought -- I think -- it was doing us a service by teaching us C on Unix in an "old school manner. They required us to use vi for our editing. I didn't use a debugger till my senior year, and only then because a grad student showed me. And it wasn't a good one...

irve said...

I've been teaching C for a while and I have drifted towards similar ideas.

My current observation is that to "get" the debugger, the student has to have some overview of the architecture of a computer. So how it currently goes is that some compilation basics are taught; then I go to stack/heap examples with some assembly examples and then to debugging and stack smashing. Then some library stuff

mjw said...

I'm not sure how you decided that GDB is not an adequate debugger. As an undergraduate my programming classes were linux based and gdb was my friend. Having since moved on to security and spending a lot of time in Olly/Immdbg and Windbg I can honestly say that gdb isn't bad and at least it doesn't crash.

Robert Graham said...

MJW: I say GDB is not adequate because the people avoid it. They use GDB as a the last resort to fix bugs, when they should be using the debugger to step through each line of code as they write the code, even when it works.

It's simply my observation: programmers using VisualStudio and XCode frequently step through working code, programmers using GDB don't, and use "printf()" in preference to GDB.

mateor said...

I am in a systems course where all the students are pretty much learning C for the first time (juniors/seniors). It is Spring Break starting today, and we have used no tools at all, except the recommendation of gedit and gcc. We are being taught in lab by a pretty great grad student, the prof has taught no coding. We are learning stuff like the pthread library and solving dining philosophers. You have inspired me to move to loading my C code in a proper IDE. I was somewhat under the impression that was looked at as cheating...

Christopher Jefferson said...

My main issue with student's depending on the debugger very early is that it is hard to teach them that what the debugger is displaying is not trustworthy if you your program has already invoked undefined behaviour (writing off the end of an array, or reading/writing freed memory being the two most obvious candidates).

Eck! said...

Been writing in C (as in K&Rs white book) since about 1979, my personal copy came from K himself. Back then most of the tools mentioned were nonexistant and the usual process was write sound and complete code, feed it to the compiler, debug the resulting ASM/MACRO code with the machine language debugger. IF you had errant reads or writes there was no hardware to block it (least not till I got to use the PDP-11s and VAX). About 80% of code I wrote back then ran on Z80, unprotected and constrained. C was close to assembler but more like macro-language. Either way
if you could be stupid in ASM, likely to could be twice as stupid in C.

Now with computing horse power and good debuggers not knowing whats going on inside is unthinkable, and likely common.

Eck!

markoer said...

GDB is simply worthless if you are doing multithreading programming.
I had this issue multiple times and with several versions of UNIX and Linux; I had to port the code to Windows and use Visual Studio to be able to effectively debug the program.
GDB simply isn't capable to switch between threads. It is OK for occasional usage, but simply isn't a professional tool.

Anonymous said...

You may have used an older version of GDB. I have used GDB to debug multithreaded programs. It may not be as pretty as Visual Studio, but GDB is a ton better than some other Unix-based debuggers I've used (I'm looking at you, dbx on Solaris!).

I also use the printf() method of debugging. Sometimes, it's just easier to use than trying to get the program running under a debugger (one program at work is both multiprocess *and* multithreaded---don't ask).