What you see as the address at the top of your web browser, like "www.google.com" or "facebook.com" is not the actual address. Instead, the real address is a number. In much the same way a phonebook (or contact list) translates a person's name to their phone number, there is a similar system that translates Internet names to Internet addresses.
There are only 4 billion Internet addresses. It's a number between between 0 and 4,294,967,296. In binary, it's 32-bits in size, which comes out to that roughly 4 billion combinations.
For no good reason, early Internet pioneers split up that 32-bit number into four 8-bit numbers, which each has 256 combinations (256 × 256 × 256 × 256 = 4294967296). Thus, why write Internet address like "192.168.38.28" or "10.0.0.1".
Yes, as you astutely point out, there are many more than 4 billion devices on the Internet (the number is closer to around 10 billion). What happens is that we can use address sharing (also called "network address translation"), so that many devices can share a single Internet adress. All the devices in your home (laptop, iPad, Nest thermistat, WiFi enabled Barbie, etc.) has a unique address that only works in the home. When the packets go through your home router to the Internet, they get changed so that they all come from the same Internet address.
This sharing only works when the device is what's called a "client", which consumes stuff on the Internet (like watching video, reading webpages), but which doesn't provide anything to the Internet. Your iPad reaches out to the Internet, but in general nothing on the Internet is trying to reach your iPad. Sure, I can make a Facetime video call to your iPad, but that's because both of us are clients of Apple's corporate computers.
The opposite of a client is a "server". These are the computers that provide things to the Internet. These are the things you are trying to reach. There are web server, email servers, chat servers, and so. When you hear about Apple or Facebook building a huge "data center" somewhere, it's just a big building full of servers.
A single computer can provide many services. They are distinguished by a number between 0 and 65,535 (a 16-bit number). Different services tend to run on "well known" ports. The well known port for encrypted web servers is 443 (no, there's no good reason that number out of 65535 combinations was chosen, it's not otherwise meaningful). Non-encrypted web-servers are at port 80, by the way, but all servers by now should be encrypted.
Web links like "https://www.google.com:443" must contain the port number. However, if you are using the default, then you can omit it, so "https://www.google.com" is just fine. However, any other port must be specified, such as "https://www.robertgraham.com:3774/some/secret.pdf". When you visit such links within your browser, it'll translate the name into an Internet address, then send packets to the combination address:port.
Normally, when you look for things on the web, you use a search engine like Google to find things. Google works by "spidering" the Internet, reading pages, then following links to other pages. After I post this blog post, Google is going to add "https://www.robertgraham.com:3774/some/secret.pdf" to it's index and try to read that webpage. It doesn't exist, but Google will think it does, because it reads this page and follows the link.
There is an idea called the "Dark Internet" which consists of everything Google can't find. Google finds only web pages. It doesn't find all the other services on the Internet. It doesn't find anything not already linked somewhere on the web.
And that's where my program "masscan" comes into play. It searches for "Dark Internet" services that aren't findable in Google. It does this by sending a packet to every machine on the Internet.
In other words, if I wanted to find every (encrypted) web server on the Internet, I would blast out 4 billion packets, one to each address at port 443. I would then listen for reply packets. All valid acknowledgements mean there's a computer with that address running such a service. When I do this, I get about 30 million responses, by the way. A single web server can host many websites, the actual number of websites is more like a billion.
Such a scan is possible because even though it takes 4 billion packets to do this, networks are really fast. A gigabit network connection, such as the type Google Fiber might provide you, can transmit packets at the rate of 1 million per second. That means, in order to scan the entire Internet, I'd only need 4 thousand seconds, or about an hour.
People get mad when I scan this fast, especially those with large networks who see a flood of packets from me in an hour. Therefore usually scan slower, at only 125,000 packets per second, which takes about 10 hours to complete a scan.
Two years ago a bug in encrypted web services was found, called "Heartbleed". How important a bug was it? Well, with masscan, I can easily send a packet to all 4 billion addresses, and test them to see if they are vulnerable. The last time I did this, I found about 300,000 servers still vulnerable to the bug.
Right at the moment, I'm doing a much more expansive scan. Instead of scanning for a single port, I'm scanning for all possible ports (all 65536 of them). That's a huge scan that would take 50 years at my current rate, or 5 years if I run at maximum speed on my Internet link. I don't plan on finishing the scan, but stopping it after a couple weeks, as sort of a random sample of services on the Internet.
One finding I have is a service called "SSH". It a popular service that administrators (the computer professional who maintain computers) use to connect to servers to control them. Normally, it uses port 22. Consider the output of my full scan below:
What you see is that I'm finding SSH on all sorts of ports. For every time somebody put SSH on the expected port of 22, roughly 15 people have decided to change the port and put it somewhere else.
There are two reasons they might do so. The first is because of a belief in the fallacy of security through obscurity, that if they choose some random number other than 22, then hackers won't find it. That's likely the case where we see old versions of SSH in the above picture, such as version 1.5 instead of the newer 2.0. That this is a fallacy is demonstrated by the fact that I can so easily find these obscure port numbers.
The other reason, though, is simply to avoid the noise of the Internet. Hackers are constantly scanning the Internet for SSH on port 22, and once they find it, start "grinding" password, trying password after password until they find one that works. This fills up log files and annoys people, so they put their services on other ports.
Note in the above picture two entries where Internet addresses starting with 121.209.84.x have SSH running at port 5000. Looking on the Internet, it seems these addresses belong to Telstra. It seems they have some standard policy of putting SSH on port 5000. If you were a hacker wanting to break into Telstra, that sort of information would be useful to you. That's the reason for doing this scan. I'm not going to grab all address:port combinations, but enough where I can start finding patterns.
Another thing I've found relates to something called VNC. It allows one computer to connect to the screen of another computer, so that you can see their desktop. It normally runs at port 5900. When you masscan the entire Internet for that port, you'll find lots of cases where people have the VNC service installed on their computer and exposed to the Internet, but without a password. This article describes some of the fun things we find in these searches, from toilets, to power plants, to people's Windows desktops, to Korean advertising signs.
But this full scan finds VNC running at other ports, as shown in the following picture.
For everybody running VNC on the standard port, it appears about 5 to 10 people are running it on some other random port. A full scan of the Internet, on all ports, would find a much richer set of VNC servers.
I tweet my research stuff often, but it's often inscrutable, since you are suppose to know things like VNC, SSH, and random/standard port numbers, which even among techies isn't all that common. In this post, I tried to describe from scratch the implications of the sorts of things I'm finding.