Friday, July 04, 2014

XKeyScore: regex foo

For those of you rusty on your regex code, I thought I'd explain those found in the alleged XKeyScore source. The first one is:


The first thing to understand is the Perl wrapper around regexes, which is emulated in other areas like Snort. This wraps the regex with slashes, followed by modifiers, such as 'i' for case-insensitive. Thus, /robert/i matches my name in a case insensitive way. Thus, in the above example, we can remove the leading and trailing slashes as being unnecessary.


As you know, plain characters match themselves. Thus, this is matching all strings that start with the letters "bridge". Note that some modifiers/flags make regexes "line-oriented", such that this would only match at the start of the line. We can't know the default flags in this code, but it's probably that it's not line oriented. Thus, "9aosuidghfbridge" will cause a match to happen, because it just doesn't care about what characters appear before 6 characters 'bridge'.


The "backslash" is often an escape character (whose exact behavior can change). Here, "\s" means "match exactly one whitespace" character (tab, space, etc.). Whether it also matches 'newline' whitespace (\n) depends upon flags.


The parenthesis creates a subpattern. There are two subpatterns in this expression.

Subpatterns are used for two reason. One is because it's sometimes necessary, such as when tracking repeated patterns, or different alternate patterns. But, as you see here, it's not necessary. The pattern will match just the same with or without the parentheses.

The second reason for subpattern is capture of the contents. That's what you see later in the code on the line:

std::string address = bridges[i][0] + ":" + bridges[i][1];

The IP address is matched/captured in the first subpattern, the port number is matched/captured in the second subpattern. The parts of the regex that are not within the subpattern clauses are used for matching, but not capturing. By default, subpatterns don't capture -- it's one of those modifiers/flags that has to be set by the underlying system.


Let's look at the first subpattern. It's going to match an IPv4 address (like The entire clause here is [0-9]{1,3}. The part [0-9] means to match any digit. They could've said \d as well, or [9876543210], or any number of other techniques. The {1,3} means that the previous match has to appear at least once and no more than three times. Thus, the entire clause matches all three-digit numbers. This is sorta bad, because that means it can match things above 255. This reflects an attitude among the creators of the system that they really don't anticipate people trying to subvert the system with bad data, since non-IPv4 addresses are never going to appear by purely by accident.


Again we have an escape sequence. The '.' character is meaningful to (some) regex, matching any character (or sometimes any non-whitespace). But in this case, we want to match the actual character, so we escape it. Something like \x2e would always work.

At this point, we repeat this four times for four three digit numbers, with dots in between. This leaves the second clause.


The next clause is :?. At this point, the colon character ':' has no special meaning, it's just matching a colon. Bridge addresses are in the standard IP:port format, with a colon in between. Note that colon characters elsewhere are significant to regex, just not here.

The question mark means "zero or one" of the preceding object. This means the colon may exist, or it may not. In other words "skip a colon if there is one".


Now for the port number in the second capture clause. This again is just the same syntax as for numbers within the IP address, but this time it matches a number with a minimum of 2 digits and a maximum of 4. This seems strange, because that means neither 5 nor 55555 are valid ports as far as they are concerned, despite them being valid. That means bridge addresses without ports below 10 and above 9999 are safe from being included. (I don't know enough about Tor -- maybe this is a limitation within Tor itself and this wouldn't work).


The question mark here means the same thing as it mean in the :? clause. It means the previous match, the port number, is completely optional. In other words, bridges with only an IP address and no port number will match. That the port number is optional is why the colon is optional.


You are familiar with the [0-9] clause from matching digits above. When the circumflex ^ is put at inside these square brackets, it means "not". Thus, this matches everything that is not a digit. Otherwise, 10 digit numbers would cause the first 4 digits to match the port number and dropping everything else. It's a common technique to cause matching to stop.

Actually, now that I think about it, this is a bit odd that this would be included inside the capture pattern.

Also, the bridge line looks like the following, including a fingerprint:

bridge 4352e58420e68f5e40bf7c74faddccd9d1349413
It's a bit odd that this captures the IP address, the port number (plus the trailing space), but doesn't also capture the fingerprint.

So there you have it, a regex that captures the IP address and port.

The second regex is looking for Onion addresses anywhere in TCP streams. Onion addresses like like:


Like the previous regex, we are going to use captures of subpatterns, in this case the scheme (meaning "http://"), the address (meaning the 16 character random string), and the port.

The regex looks like:

^                                                          ^^

Like above, we see slashes used by Perl syntax to set flags. In this case, it's setting the 'c' flag, which is a bit odd. In Perl syntax, it's supposed to only be used if the 'g' flag is also set. Presumably this means that the caller of this code is setting the 'g' flag using the API rather than in the pattern. The 'g' modifier/flag, by the way, means to match multiple times. Thus, assuming its set in the code, it means to match all the Onion addresses in the TCP stream, not just the first one. I can't for the life of me imagine why 'c' is being set on its own. One reason, which you see a lot in Snort signatures, is that regex writers don't quite remember the various modifiers, and simply allows use them, whether they are needed or not.

^^^             ^^^^^^

This is the first match, for the scheme. If you'll remember from above, a parentheses mean "subpattern", which the system will capture. However, in this case, the parenthesese is followed by a question ? mark. This is a common syntax in order to set options for the subpattern. In this case, the option being set is the : operator, which means "don't capture". The reason we want a subpattern, but one that won't capture, it because this is optional, as indicates by the {0,1} clause at the end.

One fun thing to note is that the ?: text here means something totally different from the :? in our previous example. That's what makes regexes impossible to read.


Inside the non-capturing subpattern we have a capturing one. We see the familiar [a-z] syntax, similar to the [0-9] syntax above. In this case, it matches all lower case letters. Now, outside this regex modifiers/flags may have been set to make this case-insensitive, in which case upper-case will be matched as well.

The '+' behind the [a-z] means to match "one or more". It's similar to using the expression {1,}, meaning it has a lower bound of matching at least one character, but no upper bound on length. This can be used by attackers to cause very long strings to match, as I describe in my other XKeyScore posts, potentially several kilobytes of data can be inserted here.


The remaining parts of this subexpression are pretty obvious. The colon matches itself, a colon character, and is not a control character at this point. The backslash is the escape character, escaping the forward slash character, matched twice. Thus, this will match at least the expected strings of "http://" or "https://". 

^            ^

Again, we have another subpattern that will capture it's contents.

Again, we have the [brackets] syntax. Now it's matching not only letters but also digits. Because of the syntax of onion addresses, only the digits 2-7 are used, not 0, 1, 8, and 9. That's because to make them easy to type, because the letter 'O' can be confused with the digit '0', the letter 'l' with the number '1', and so on. The {16} means to match exactly 16 times, no more, and no less.


This matches the string ".onion", which all onion addresses must have (again, the . character must be escaped). This string isn't inside a capture group, so it's only matched, not captured.

^^^      ^^^^^^

Like above, we have an optional non-capturing subpattern marked by "(?:   ){0,1}". If it exists, it's captured, but it's allowed not to exist.


Again, we see the joy of regex. The first colon is syntax, meaningfully saying "don't capture this subpattern". The second colon is just a normal pattern character, the colon that exists before the port number in "http://o8a7sgdfoaiusgdf.onion:443".


This is a capture subexpression for capturing one or more digits (\d means 'digit', + means one-or-more). Note that this is unbounded, so it can capture a lot of digits. Or, one could send a megabyte of digits at this point in order to overload the system.

A fun thing to note here how optional port numbers are handled in this regex completely different from our first regex, which used the string :?([0-9]{2,4}?[^0-9]). This is one of the magic things of regex that so many ways are possible. It also fingerprints the coders as different people. Or, alternatively, it fingerprints some source on the Internet (such as Snort rules) where they copied the regex.

The XKeyScore code that uses these regexes probably is using the PCRE library. There are many different types of regex libraries that are possible, but PCRE has pretty much become the standard library.

PCRE has a lot of options that you can set in your C/C++ code that are also settable within the regexes themselves, like case-insensitivity, line-mode, UTF8 parsing, and so on. We can only guess what those settings might be. If UTF8 is set, for example, then we can probably play more tricks with that extra character it captures for the bridge port.

These are the only two regexes I find here. Regex is dirty and nasty, so I hope you had fun deciphering them.

1 comment:

michiexile said...

I'm noticing that the first regex handling port numbers will do weird stuff with bad input: e.g. it will match 999.999.999.9999999 as IP 999.999.999.999 and port 9999 since the : was optional, whereas the second one actually does only match on port numbers if they show up with a :.