Tuesday, May 31, 2016

From scratch: why these mass scans are important

The way the Internet works is that "packets" are sent to an "address". It's the same principle how we send envelopes through the mail. Just put an address on it, hand it to the nearest "router", and the packet will get forwarded hop-to-hop through the Internet in the direction of the destination.

What you see as the address at the top of your web browser, like "www.google.com" or "facebook.com" is not the actual address. Instead, the real address is a number. In much the same way a phonebook (or contact list) translates a person's name to their phone number, there is a similar system that translates Internet names to Internet addresses.

There are only 4 billion Internet addresses. It's a number between between 0 and 4,294,967,296. In binary, it's 32-bits in size, which comes out to that roughly 4 billion combinations.

For no good reason, early Internet pioneers split up that 32-bit number into four 8-bit numbers, which each has 256 combinations (256 × 256 × 256 × 256 = 4294967296). Thus, why write Internet address like "" or "". 

Yes, as you astutely point out, there are many more than 4 billion devices on the Internet (the number is closer to around 10 billion). What happens is that we can use address sharing (also called "network address translation"), so that many devices can share a single Internet adress. All the devices in your home (laptop, iPad, Nest thermistat, WiFi enabled Barbie, etc.) has a unique address that only works in the home. When the packets go through your home router to the Internet, they get changed so that they all come from the same Internet address.

This sharing only works when the device is what's called a "client", which consumes stuff on the Internet (like watching video, reading webpages), but which doesn't provide anything to the Internet. Your iPad reaches out to the Internet, but in general nothing on the Internet is trying to reach your iPad. Sure, I can make a Facetime video call to your iPad, but that's because both of us are clients of Apple's corporate computers.

The opposite of a client is a "server". These are the computers that provide things to the Internet. These are the things you are trying to reach. There are web server, email servers, chat servers, and so. When you hear about Apple or Facebook building a huge "data center" somewhere, it's just a big building full of servers.

A single computer can provide many services. They are distinguished by a number between 0 and 65,535 (a 16-bit number). Different services tend to run on "well known" ports. The well known port for encrypted web servers is 443 (no, there's no good reason that number out of 65535 combinations was chosen, it's not otherwise meaningful). Non-encrypted web-servers are at port 80, by the way, but all servers by now should be encrypted.

Web links like "https://www.google.com:443" must contain the port number. However, if you are using the default, then you can omit it, so "https://www.google.com" is just fine. However, any other port must be specified, such as "https://www.robertgraham.com:3774/some/secret.pdf". When you visit such links within your browser, it'll translate the name into an Internet address, then send packets to the combination address:port.

Normally, when you look for things on the web, you use a search engine like Google to find things. Google works by "spidering" the Internet, reading pages, then following links to other pages. After I post this blog post, Google is going to add "https://www.robertgraham.com:3774/some/secret.pdf" to it's index and try to read that webpage. It doesn't exist, but Google will think it does, because it reads this page and follows the link.

There is an idea called the "Dark Internet" which consists of everything Google can't find. Google finds only web pages. It doesn't find all the other services on the Internet. It doesn't find anything not already linked somewhere on the web.

And that's where my program "masscan" comes into play. It searches for "Dark Internet" services that aren't findable in Google. It does this by sending a packet to every machine on the Internet.

In other words, if I wanted to find every (encrypted) web server on the Internet, I would blast out 4 billion packets, one to each address at port 443. I would then listen for reply packets. All valid acknowledgements mean there's a computer with that address running such a service. When I do this, I get about 30 million responses, by the way. A single web server can host many websites, the actual number of websites is more like a billion.

Such a scan is possible because even though it takes 4 billion packets to do this, networks are really fast. A gigabit network connection, such as the type Google Fiber might provide you, can transmit packets at the rate of 1 million per second. That means, in order to scan the entire Internet, I'd only need 4 thousand seconds, or about an hour.

People get mad when I scan this fast, especially those with large networks who see a flood of packets from me in an hour. Therefore usually scan slower, at only 125,000 packets per second, which takes about 10 hours to complete a scan.

Two years ago a bug in encrypted web services was found, called "Heartbleed". How important a bug was it? Well, with masscan, I can easily send a packet to all 4 billion addresses, and test them to see if they are vulnerable. The last time I did this, I found about 300,000 servers still vulnerable to the bug.

Right at the moment, I'm doing a much more expansive scan. Instead of scanning for a single port, I'm scanning for all possible ports (all 65536 of them). That's a huge scan that would take 50 years at my current rate, or 5 years if I run at maximum speed on my Internet link. I don't plan on finishing the scan, but stopping it after a couple weeks, as sort of a random sample of services on the Internet.

One finding I have is a service called "SSH". It a popular service that administrators (the computer professional who maintain computers) use to connect to servers to control them. Normally, it uses port 22. Consider the output of my full scan below:

What you see is that I'm finding SSH on all sorts of ports. For every time somebody put SSH on the expected port of 22, roughly 15 people have decided to change the port and put it somewhere else.

There are two reasons they might do so. The first is because of a belief in the fallacy of security through obscurity, that if they choose some random number other than 22, then hackers won't find it. That's likely the case where we see old versions of SSH in the above picture, such as version 1.5 instead of the newer 2.0. That this is a fallacy is demonstrated by the fact that I can so easily find these obscure port numbers.

The other reason, though, is simply to avoid the noise of the Internet. Hackers are constantly scanning the Internet for SSH on port 22, and once they find it, start "grinding" password, trying password after password until they find one that works. This fills up log files and annoys people, so they put their services on other ports.

Note in the above picture two entries where Internet addresses starting with 121.209.84.x have SSH running at port 5000. Looking on the Internet, it seems these addresses belong to Telstra. It seems they have some standard policy of putting SSH on port 5000. If you were a hacker wanting to break into Telstra, that sort of information would be useful to you. That's the reason for doing this scan. I'm not going to grab all address:port combinations, but enough where I can start finding patterns.

Another thing I've found relates to something called VNC. It allows one computer to connect to the screen of another computer, so that you can see their desktop. It normally runs at port 5900. When you masscan the entire Internet for that port, you'll find lots of cases where people have the VNC service installed on their computer and exposed to the Internet, but without a password. This article describes some of the fun things we find in these searches, from toilets, to power plants, to people's Windows desktops, to Korean advertising signs.

But this full scan finds VNC running at other ports, as shown in the following picture.

For everybody running VNC on the standard port, it appears about 5 to 10 people are running it on some other random port. A full scan of the Internet, on all ports, would find a much richer set of VNC servers.


I tweet my research stuff often, but it's often inscrutable, since you are suppose to know things like VNC, SSH, and random/standard port numbers, which even among techies isn't all that common. In this post, I tried to describe from scratch the implications of the sorts of things I'm finding.


John Thacker said...

There's a couple other reasons why SSH might be on another port, as you know. One reason is that their ISP (or corporate employer) might have a blunt or dumb firewall that blocks port 22. Another reason is that they use port forwarding combined with their NAT to forward to ssh on various devices with private IP addresses.

Anonymous said...

It's quite remarkable that you're finding 15 times as many SSH servers on non-standard ports than on port 22. That's exactly counter to what I would've expected. I realize that it's fairly widespread advice for admins to put SSH on a non-standard port to reduce noise from script kiddies, but I would've expected that there are far more people who simply run "apt-get/yum install openssh-server" and forget about it, than who actually have heard and follow that advice.

Have you by chance spotted any patterns that may suggest why these SSH servers on non-standard ports are so prevalent? A default configuration of some particular system(s) by chance? On the other hand, a large enough proportion of the results in your screenshot appear to be the run-of-the-mill "Ubuntu", "Debian", etc. banners you'd expect from popular desktop/server Linux distros, so maybe it really is true that people are manually following this "rule of thumb" on a widespread basis. It still seems counterintuitive, though. Users are too lazy for that. :-)

Conversely, is there something about your method that might be undersampling port 22 by chance?

Simon said...
This comment has been removed by the author.
Simon said...

"For no good reason, early Internet pioneers split up that 32-bit number into four 8-bit numbers"

For human easy reading that's why.
The 4 values make sense because of how the addressing is split in powers of two.
Good luck identifying ranges with a 32 bits number.

But I guess you know that, so why do you say that ?

Having a human format that doesn't impact the native machine format is only a plus.

nakchak said...

"For no good reason, early Internet pioneers split up that 32-bit number into four 8-bit numbers"

Was always under the impression that as most of the very early donkey work was done in the '70's on 8 and 16bit micro processor architectures then being able to address the memory which contains the address is considerably easier and more importantly quicker (we are talking of a clock speed measured in the single MHz region) if you have 4 8 bit numbers, which can be read serially directly to the register, rather than buffered and accessed via pointer...

As a fringe benefit we get human readable/memorable identifiers, but really its down to physical engineering limits and optimised memory access.