Tuesday, September 02, 2008

Spammers like Aarvarks more than Zebras?

In the news recently is this story that suggests how much spam you get may depend on the first letter in your e-mail address. It suggests that if you choose an e-mail address like "zebra@erratasec.com" you will receive a smaller proportion of spam than if you choose an address like "aardvark@erratasec.com".

This paper has impressive scientific looking graphs. The problem is that it really isn't scientific at all. It is a lot of guesswork built upon assumptions.

One of the first problems is that they only looked at a single ISP, Demon Internet (a big ISP in Great Britain). The effects they see could be localized to that ISP.

Another problem is that they ignore most spam. Demon Internet blocks connections from "blackholed" IP address (Internet addresses that are known to send lots of spam). They also ignore other kinds of spam, such as those pretending to be bounce messages. The spam they are ignoring may change the picture if it were factored in.

Another problem is how they classify spam, which is done by "Cloudmark". What they may be seeing is not so much that "aardvark" receives more spam than "zebra", but that Cloudmark is more likely to identify is as such (possibly falsely even).

The author theorizes that "the root cause is likely to be spammers using 'dictionary' or 'Rumpelstiltskin' attacks to guess valid email addresses". There is not nearly enough data to support that theory.

I would suggest a different cause. The UK has a lot of recent immigrants, especially from places like Poland, who have names that start with letters that are not otherwise common in the UK, such as 'z', 'v', 'o', etc. Other spam studies show that English is the most spammed language. An immigrant speaking another language is therefore likely to receive less spam simply because they aren't giving out their e-mail address to English-speaking places.

My theory is testable by doing the same study using the LAST letters of e-mail addresses instead of the FIRST. I suspect that the letter 'i' not a common last letter of English surnames, but more common elsewhere (such as Poland or Italy). If the author's theory is correct, then there should be no significant distribution among the last letters of e-mail names. If my theory is correct, you'll see a similar pattern as with the first letter. (Note that I doubt either theory is correct - there is probably more going on than either of us can imagine).

In scientific terms, this is a "control". Finding a pattern is spammed e-mail addresses isn't interesting unless you can show that a pattern is unlikely or surprising. I'm not surprised that they found a pattern with the first letter of e-mail addresses. I suggest, however, that instead of the single reason they found "likely" that there would be many reasons.

The reason I'm jaded on this issue is the old paper on Outwitting the Witty Worm The paper concluded that the worm targeted a "hit-list" of machines on a US Military base. However, the paper was deeply flawed because the authors looked only at the packet headers instead of the packet payload. If they had examined the payloads, they would have found that there was no "hit-list" targeting a military base. Their conclusion that the data "suggests" an "insider" (who knew about the systems) was therefore completely false.

That is the thing we learn over time in our industry. There are a lot of interesting anomalies to be found out there, but the theories explaining the anomalies are usually bogus. I find Dr. Clayton's anomalies interesting, but I believe his conclusions have absolutely no validity.

3 comments:

Marisa Fagan said...

You might be jaded, Rob, but I think your take on this subject is very insightful :)

Unknown said...

I don't think the Witty paper says that. The targets weren't hardcoded in the worm payload, but based on PRNG state they were able to reverse engineer "patient 0". I would say it's speculative to conclude "patient 0" was specifically targeted.

Or am I misunderstanding what you're saying?

Mark Teicher said...

A real interesting benchmark tool that could be use to prove either of the two theories presented would
Postal (http://www.coker.com.au/postal/), one can use one of the utilities included postal-list to simulate the "aardvarks to zebras" study, and then use the randomizer IP library to simulate random IP addresses for the source.