## Tuesday, December 31, 2013

### Masscan: designing my own crypto

Like many programmers, one of the things I like to do is design my own crypto algorithms. Specifically, at the heart of my port-scanner masscan is a cryptographic algorithm for randomizing IP addresses and port numbers.

This algorithm has flaws. Well, it's good enough for port scanning, but it's not cryptographically secure. In this post, I describe how graph stuff so that these flaws can be detected. Update: I added a second nmap sample to compare against.

One principle of cryptography is that perfect crypto is also a perfect random number generator. I use this trick to randomize my scans. I iterate over all possible targets, then 'encrypt' the index variable, so that it picks the targets randomly. I created a short JavaScript program that describes this in more detail.

The problem is that the number of targets isn't an even binary number, like 32, 64, or 128. Instead, the number of targets being scanned will be a weird number, like 97 or 202. Modern crypto is binary, based on operations like XOR. It can't handle non-binary numbers. Therefore, I had to find an alternative.

There are a number of solutions to this problem, but my current solution uses an algorithm I call blackrock, after the authors of this paper. This algorithm is based on DES, the plain old Data Encryption Standard. It replaces the binary operations like XOR with it's non-binary mathematical equivalents. The upshot is that when I encrypt a non-binary number in the range of [0-96], the result will be another number within the same range.

If you look at my implementation, though, you'll notice some flaws. Among the bad things I do:

Thus, we know that this function isn't cryptographically secure -- at least not yet (until I fix it).

Detecting the flaws

Given just the output, how can we detect these flaws? One quick-and-dirty way is to graph the output -- this allows the human eye to pick out patterns that might otherwise be invisible.

The way I do this simple. I grab the first thousand (1000) IP addresses from a scan, then use the final two bytes of the address as x and y coordinates. Specifically, I run masscan with the following Unix command-line:

masscan 10.0.0.0/8 -p80 -sL --seed 1 | cut -d . -f 3,4 | tr . , | head -n 1000 > plot1.csv

The -sL option lists the IP addresses instead of scanning. The cut and tr commands extract the final two octets of an address and replace the period with a command, thus an IP address like "10.23.7.129" because "7,129". I then save this as a csv file.

I imported this csv file into Excel and graphed the results, shown below:

What you see here is that the output is indeed mostly random. Had I used to a simple mathematical function to randomize the output, instead of a cryptographic function, you'd see clear mathematical artifacts (like straight lines).

While there is no hard artifact, you do see some horizontal banding going on. This shows that I'm not as random as I'd hoped.

For comparison, I did the same thing with nmap using it's '-iR' feature:

nmap -iR 1000 -n -sL | cut -d . -f 3,4 | tr . , | head -n 1000 > plot-n.csv

I get the following output:

Update: The function that generates this is here (h/t @bonsaiviking). It uses a typical random number generator, but with three rounds, and a shifting technique remove patterns from the output. It's based on the fact that the -iR option chooses from the entire 32-bit address space, so it can use a binary technique.

Update:  On twitter, @bonsaiviking suggests that I should use nmap's --randomize-hosts option instead of -iR. I therefore ran the following command:

nmap 10.0.0.0/8 --randomize-hosts -n -sL | cut -d . -f 3,4 | tr . , | head -n 1000 > plot-n2.csv

Graphing this produces the following results:

There are two things to notice. The first is that the x axis only goes to 64, not to 256. That's because nmap is stateful and synchronous. It stores a list of all IP addresses in memory, then rearranges their order in memory. It only does 64 subnets at a time, so when looking at the first 1000 output values, we see only the first 64 subnets. The code nmap uses is here. (The reason I wrote masscan was to avoid this in-memory shuffling).

The second thing to notice is that there are lots of vertical lines in the above pattern. This is confusing, because in theory, nmap uses a hard cryptographic function, RC4, to shuffle things, so no patterns should be visible. I've rerun this a couple times, and I still see those vertical lines. Assuming I'm not imagining these vertical lines, it's a good example of discovering patterns when the original programmers believe no such patterns should exist.

Performance

The reason for masscan's currently non-cryptographic randomization is idiv or integer division operation. On Intel CPUs, the integer division operation takes 90 clock cycles. The blackrock crypto algorithm does one of these per round (actually, two, but Intel processors can execute two divisions in parallel). This is by far the slowest part of the code: everything else my code does to generate packets takes less time than the roughly ~300 clock cycles needed to run the encryption. That's why I don't arbitrarily increase the number of rounds from 3 to 16. There is way around doing integer division, but I haven't implemented it yet. Once I do, I'll probably expand the algorithm out to the full 16 rounds, and make it roughly cryptographically secure.

Conclusion

The point of this post is to discuss the issue of "designing your own crypto". According to mathematical tests, masscan's output is perfectly random. Yet, but using a common trick to visualize the output, I can see that it's not perfectly random. There are artifacts that would not be visible had the encryption been correct.

Now, the output of masscan isn't supposed be cryptographically secure, so this isn't necessarily a "bug", but it would be nice if it were secure. Eventually I'll get around to adding this feature.

Update: Tom Sellers, an nmap contributor (according to his profile), uses the "sort -R" command to randomize hosts (as described here and here). So I graphed this:

I see no artifacts. According to "man sort", the -R option does a "sort by random hash of keys". I don't know what that means. Typical "hash table" style hashes would show artifacts in the above graph. That no artifacts are visible suggests they are using a cryptographic hash function, like MD5.

Update: Paul McMillan sent me the first 1000 addresses from ZMap, another Internet-scale port scanner. They describe their function as a simple mathematical equation which I assume would show up some hard artifacts:

However, I don't see any artifacts. Maybe I need to try a bunch of other visualizations in order to get the at the relationships that are supposed to be in the nmap -iR and ZMap scans.