Friday, July 04, 2014

Validating XKeyScore code

The burning questions about the XKeyScore “source code” is whether it’s real, and whether it come from Snowden. The Grugq (@thegrugq) has some smart insight into this, and I have my own expertise with deep-packet-inspection code. I thought I’d write up our expert analysis to the questions.

TL;DR: we believe the code partly fake and that it came from the Snowden treasure trove.

A slightly longer summary is:
  1. The signatures are old (2011 to 2012), so it fits within the Snowden timeframe, and is unlikely to be a recent leak.
  2. The code is weird, as if they are snippets combined from training manuals rather than operational code (like this bit). That would mean it is “fake”.
  3. The story makes claims about the source that are verifiably false, leading us to believe that they may have falsified the origin of this source code.
  4. The code is so domain specific that it probably is, in some fashion, related to real XKeyScore code – if fake, it's not completely so.


As this post to the Tor developer mailing list describes, the signatures in the code are old. The earliest date this file can be valid is 2011-08-08, when the Linux journal reported on TAILS. The latest date might be 2012-09-21, just before a new server was added to Tor that isn't in the XKeyScore list. Since this is shortly before Snowden first tried to contact Greenwald, the dates sync up.

Likewise, the bridges info is over a year out of date, again pointing an old leak in the Snowden timeframe rather than a new leak.

As many have commented, it looks like disjoint snippets pulled from many files. The code references variables (like $tor_directory) that are missing from this file. Many (like myself) assumed that these snippets were pulled from source files (text files ending in an extension like .xks). However, on Twitter today between myself, @0xabad1dea, and mostly @thegrugq, we came to the conclusion that they probably come from document files (.ppt, .pdf, .doc, etc.). This document files could be training manuals designed for analysts and engineers, PowerPoints designed to impress others in the intelligence community how advanced the system is, or a document explaining how the NSA was dealing with the Tor threat. An example would be this presentation on monitoring images from cellphones that contains a small bit of code.

The filename xkeyscorerules100.txt is implausible. Source files do not end in ".txt" and the term "rules" is an odd choice. [Though, as people point out, changing a file extension to .txt may be done just so the hosting webserver delivers the right content-type of text/plain].

In other words, instead of being real operational code running in the field, there is a good chance that this is just samples scattered around various documents within the Snowden trove. That would explain why Bruce Schneier, who has seen Snowden docs, believes there is a second leaker because he doesn't remember seeing this source.

That would also explain the comments, like those mentioning how extremists use TAILS. These comments are unlikely to appear that way in real source files. However, they are precisely the sort of comments you'd expect in a training manual describing how to write XKeyScore fingerprints.

This would also explain why two different regexes in the file use two different techniques for capturing port numbers, and why two different snippets of C++ code use two different techniques for inserting data into a database.

As a deep-packet-inspection (DPI) expert, I can confirm that this code is too "real" to be completely fake. It could be fake in the sense that it's training manual code or prototype, but it's definitely related to XKeyScore somehow. If it's completely fake, then only another expert in DPI could've faked it. I just don't think a non-expert is smart enough to fake it this completely.

The original press story makes willful misrepresentations, such as claiming those servers are under surveillance. This isn't true, it's unlikely the NSA has a fulltake sensor monitoring all traffic in/out of the servers. Instead, it has fulltake sensors elsewhere in the world (like Iraq) that captures all sessions, and this code simply annotates/indexes which sessions below to those servers.

Another misrepresentation in the story is that the source calls the Linux Journal an extremist forum. That's not true.

A comment does say that TAILS is "a comsec mechanism advocated by extremists on extremist forums". This is true, as the picture (from the Grugq) demonstrates on the right: it's a picture from an ISIS/jihaid forum advocating the use of TAILS. But nowhere does it claim that the Linux Journal is one of those extremists -- that's something willfully made up by the authors of the story. In other words, they interpret "extremists use TAILS" as meaning "only extremists use TAILS". It's obvious that's not what's meant -- other leaked Snowden documents acknowledge Tor has benign uses, such as protecting dissidents in China.

That the story already misrepresents the meaning of this source code hints that it may already be misrepresenting the provenance.


We believe the file was faked in some fashion. The missing global variables are proof of this.

It could simply be that the snippets were pulled from legitimate source files, pulling together all the pieces that relate to Tor.

Or, it could be majorly faked, in that this isn't operational XKeyScore code at all, but just examples or exercises pulled from training manuals.

But, it's unlikely to be completely fake -- because to fake it to this level, you'd need actual prototype code that would serve XKeyScore needs in the first place.

The story makes misrepresentations about the source already, they may have made more about the validity of this code.

We therefore know at best, the source code has been altered in some fashion, and at worst that it's related to XKeyScore in some fashion, even if it's not operational source code.


The accusation that the journalists willfully misrepresented things is a strong one, so I've copied the text below. The original story starts with the following bullet point:
It also records details about visits to a popular internet journal for Linux operating system users called "the Linux Journal - the Original Magazine of the Linux Community", and calls it an "extremist forum".
The relevant source code says this (bold added by me)

These variables define terms and websites relating to the TAILs (The Amnesic
Incognito Live System) software program, a comsec mechanism advocated by
extremists on extremist forums.

$TAILS_terms=word('tails' or 'Amnesiac Incognito Live System') and word('linux'
or ' USB ' or ' CD ' or 'secure desktop' or ' IRC ' or 'truecrypt' or ' tor ');
$TAILS_websites=('') or ('*');

As you can see, the source is not calling the Linux Journal an extremist forum, the two aren't related. Saying "extremists advocate TAILS" does not imply the NSA believes "only extremists use TAILS".

Also, the code is no more tracking Tor users than this code tracks camera users. This a purely an indexing function.


Gary Myers said...

If it is from documentation, then the leak may have been later as the document may not be as actively maintained as source code.

DarkIye said...

I shall be keeping an eye on how much longer this story is accepted as fact.

skyemarielopez said...

the contents are very informative and useful. i'll be sharing this to my "comrades"...