Wednesday, October 08, 2014

Wget off the leash

As we all know, to grab a website with wget, we'll use the "-r" option to "recurse" through all the links. There is also the '-H' option, means that wget won't restrict itself to just one host. In other words, with '-r -H' together, it'll try to spider the entire Internet. So I did that to see what would happen.

Well, for a 32-bit bit process, what happened is that after more than a month, it ran out of memory. It maintained an ever growing list of URLs that it has to visit, which can easily run in the millions. At a hundred bytes per URL and 2-gigabytes of virtual memory, it'll run out of memory after 20 million URLs -- far short of the billions on the net. That's what you see below, where 'wget' has crashed exhausting memory. Below that I show the command I used to launch the process, starting at cnn.com as the seed with a max timeout of 5 seconds.



How much data did I download from the Internet? According to 'du', the answer is 18-gigabytes, as seen in the following screenshot:



It reached 79425 individual domains, far short of the millions it held in memory. I don't know how many files it grabbed -- there's so many that it takes hours to traverse the entire directory tree.

What sorts of domains did it visit? As you can see in the screenshot, all sorts of stuff, like "www.theemporiumbarber.com.au" or "hairymenofcolor.tumblr.com". How all this stuff is reached via "cnn.com", I just don't know.


Note that the point of this experiment wasn't to actually spider the net; there are far better tools for that. Also, there is a nice project on Amazon AWS called the "Common Crawl Corpus" where they crawl the Internet for you (billions of links) and then let you process it with your own EC2 instance.

Instead, the point is what hackers always do. In this case, it's answering the question "I wonder what -H does". I mean, I know what it does, but I still wonder what happens. Now I've got a nice 18G of random stuff from the Internet that is what happens.

You can get better, more rigorous data sets (like the Common Crawl stuff), but if you want a copy of this data set, hit me up at the next hacker/security con. I'll probably have it on a USB 3.0 flash drive (srsly, my flash drives are now 64gigabyte in size -- for the small ones). It'll be good for various testing projects, like building parsers for things like JPEGs or PDFs.


1 comment:

dre said...

How good are the crawlers? How do they perform in the WIVET benchmarks? How well do they handle execution flow when input parameters include esoteric technology components such as objects, Remoting, GWT, JSON/XML/etc?