Monday, October 22, 2007

Password Cracking vs. Core2/Athlon64

Today's CPU's can "crack" passwords 8 times faster than they can "check" the passwords. Normal software checks a password using roughly 500 CPU cycles, but a brute force password cracking program can check a password in roughly 65 CPU cycles. Brute force programs do this because they check multiple passwords at once using the parallel features of the CPU.

The United States is currently working on new "Advanced Hash Standard" to replace the current SHA-1 (to be used in things such as passwords). Normally, crypto algorithms are designed ignoring brute-force cracking, because it only makes a small difference overall (it only changes a constant on the complexity of the algorithm). However, since we know that software cracking is the primary way of breaking these things, it might be worthwhile tweaking the algorithm to make cracking more difficult.

In other words, if brute-force software can crack passwords eight times faster, we know that that a 160-bit hash algorithm has only 157-bits of effectiveness. If we tweak the algorithm to defeat the brute-force acceleration, then the algorithm would have a full 160-bits worth of effectiveness.

Today's CPUs are "wide", which means they can execute multiple instructions simultaneously. However, this only works if the instructions are INDEPENDENT. If the two instructions operate on the same registers (DEPENDENT), then the second instruction must wait for the first to complete before it can execute.The following is an example:

Dependent (runs in two clock cycles)
add r1 -> r2
add r2 -> r3

Independent (runs in one clock cycle)
add r1 -> r2
add r7 -> r8

The problem with cryptographic code is that it normally contains a long chain of instructions dependent upon each other. Thus, checking a single key takes 500 instructions, and therefore roughly 500 clock cycles. To get around this, brute force password crackers check multiple passwords at the same time, then interleave the instructions. They check eight passwords in the same 500 clock cycles.

The latest processors from AMD and Intel contain 3 execution units. Thus, in theory, they could check 3 passwords at the same time it would normally take to check a single password.

An even more powerful feature is the "SIMD" units. The acronym SIMD means "single-instruction multiple-data". This widens the registers so that they can hold 4 separate values simultaneously. A single instruction that adds two SIMD registers together would in fact do four simultaneous additions.

The modern processors from Intel and AMD contain 2 SIMD execution units. This means they can, in theory, check 8 passwords at the same time (two sets of registers, each holding four values).

The following is a snippet of code from John the Ripper that cracks Windows passwords. Each SIMD instruction is repeated twice for two different sets of passwords, and each register holds four different values for four different passwords. Thus, eight passwords are checked simultaneously.

paddd (512*base)+(x*32)+nt_buffer8x, aa
paddd (512*base)+(x*32)+16+nt_buffer8x, aa3
movdqa cc, t1
movdqa cc3, t13
pxor dd, t1
pxor dd3, t13
pand bb, t1
pand bb3, t13
pxor dd, t1
pxor dd3, t13
paddd t1, aa
paddd t13, aa3
movdqa aa, t2
movdqa aa3, t23
pslld $s, aa
pslld $s, aa3
psrld $(32-s), t2
psrld $(32-s), t23
por t2, aa
por t23, aa3

Right now, the above John the Ripper code doesn't give the eight-fold increase you would expect. It would take a bit more machine-language zen than simply duplicating the instructions in order to get closer to the theoretical performance. The second reason is that John the Ripper isn't fully optimized for this level of speed: while inner loop can check a password in 65 cycles instead of 500 cycles, it still takes 100 cycles for the program to choose a new password. In other words, if we could make checking passwords infinitely fast, it still would only double the speed of the code.

To get more speed, ee'd have to bring new password selection deeper into the code. For example, we could pass in a single password, and the code could add the constants {0,1,2,3, 4, 5,6,7} onto the end, thus choosing the eight sequential passwords in only a single clock cycle.

In order to defeat parallel cracking of hashes, the next hash standard could include parallel elements itself. They would have to break the long dependency chains in the algorithm into 8 independent chains. This means they could also increase the number of logical operations within the algorithm, making it 8 times stronger. Thus, a single hash could be calculated in the same 500 cycles as today's hashes, but with 8 times the number of logic operations. Brute-force crackers would therefore get no benefit from checking multiple keys at the same time.

An alternate design technique would be to include features that make SIMD processing difficult. Raw cryptographic operations consist of "permutations" and "substitutions". A permutation is an operation like "shift", "xor", "add", etc. A substitution is a table lookup of a new, mathematically unrelated value given a current value. For example, give a byte with a value of 0x61, looking the up in a table might replace that with 0xF7. There are SIMD instructions for all the permutation operators, but there aren't any equivalents for substitution operations.

The hashes we use now for passwords, such as MD4 and MD5, do not use substitutions, which is why they can be accelerated with SIMD. Some passwords still use DES, which contains a lot of substitution operations. Luckily (for crackers), a DES substitution can be replaced with multiple permutations, so it still gets accelerated by SIMD, but not as much acceleration as with MD5.

All of this may become a moot point. You can now get cheap FPGAs (for example, from Pico Computing) and open source code that cracks passwords even faster. You can also exploit graphics cards (and their GPUs) for faster cracking. Finally, botnets can herd millions of machines into a password cracking effort. With all this additional power, making software cracking slightly harder may not be worth it.


CONCLUSION: Today's crypto algorithms ignore the practicalities of software brute-force cracking. As a result, they are a few bits less effective than theory would admit. The reason is that carefully designed cracking code can take advantage of the parallelism in modern processors (multiple execution units, SIMD processors). A few algorithmic changes can defeat this, either by incorporating parallelism into the algorithm design, or by including features that are difficult to make parallel. This may be unnecessary: while the majority of brute-force cracking is done in software today, hardware approaches using FPGAs or graphics-cards may be the way of the future.

FAQ: BTW, by "multiple execution units", I mean in a single CPU core. I'm ignoring multiple cores, which do make cracking faster as well, but at the normal Moore's Law progression.

4 comments:

mokum von Amsterdam said...

Funny that you mention cracking pwd's ATM. The user of the GPU seems a good option too:

http://www.theregister.co.uk/2007/10/24/elcomsoft_uses_geforce8_for_password_crack/

Scott said...

Yes I find that running a personal firewall does help to keep threats off of my computer but there’s a lot more to antivirus internet security than just a firewall. As we all know a good firewall keeps something things out and can help protect against some viruses and data theft but many common spyware and malware attacks often find their way through.

The trick today when dealing with pc security is to find one comprehensive program that does it all. You need shouldn’t have to purchase Identity Theft Software in addition to other endpoint security programs, this type of thing should be included in one package. So my advice to you is to spend some time looking around the internet for good a program that will provide you with complete desktop security because a firewall alone just isn’t going to cut it.

Chris Adams said...

At some point doesn't it make more sense to focus on updating common systems to use something like bcrypt? I suspect that everyone will hear that they should use "AHS1() because it's really hacker-proof" and we'll be in the same situation after a few cycles of Moore's Law.

UncleK said...

Is your envisioned attack scenario that of someone getting hold of a machine and attacking the password store?