Friday, October 26, 2012

A question for crypto/math peoples

I have a problem that you math/crypto types can probably easily answer. I want to create a one-to-one mapping from an input n bits long to a pseudorandom output of n bits.

In other words, consider n being 8-bits. I want a function that for every input number from [0..255] produces a random output number between [0..255], with each output number produced only once. In other words, it's a one-to-one mapping of 8 input bits to 8 output bits.

The traditional linear congruent generator PRNG used for rand() seems like an excellent choice. For n=32, I could use a function like:

unsigned translate32(uint32_t x)
{
    return (x * 214013 + 2531011) & 0xFFFFFFFF;
}

These constants are chosen to produce all 4 billion possible outputs. meaning there is a one-to-one mapping between inputs and outputs.

However, I can't create an 8 bit version by doing:

#define translate8(x) translate32(x & 0xFF)& 0xFF

That's because while the constants work for 32 bits, they don't work for 8. Some outputs will be produced twice, and some outputs won't be produced at all. To make a translate8() function, I'll have to choose different constants.

How do I go about choosing the right constants? Do I simply write a program that tests a bunch of combinations until I get some that work the way I want? The thing is, I want this to work all the way out to 48-bit inputs. I doubt I could discover the right constants for 48 bits, it's not computationally feasible, unless there's some trick to it. (I really should study my Knuth better).

Or, maybe I can do this for small n and simply combine them. For a 48-bit number, simply combine a 32-bit version with a 16-bit version.


Why do I want to do this? The answer is port scanning. I want to scan a subnet, but instead of linearly scanning the addresses [192.168.1.1, 192.168.1.2, 192.168.1.3, ...] I want to scan in a random fashion of [192.168.1.148, 192.168.1.15, 192.168.1.231, ...]. I want to do the same for port numbers. Let's say that I'm scanning all ports on that subnet. This means the output is a 24-bit number: 8-bits for the subnet and 16-bits for the port number.

By the fact that I said "48-bits" should hint that what I really want to do is use this for scanning the entire Internet. I want anybody seeing incoming IP/port combinations to see them in a random order and random timing between them.

A pseudo-randomly congruent function means I can take linear scanning code that thinks it's trying addresses/ports in a sequential fashion and produce a random sequence out of it.




Update:

BAH! As many people point out "lrn2readWikipedia". The solution is at https://en.wikipedia.org/wiki/Linear_congruential_generator#Period_length. Somehow I'd read that, but totally missed it. My only excuse is that I'm a programming nerd and not a theory/math nerd, and thus miss obvious things like that.

The equation is:
unsigned translate(unsigned x, unsigned a, unsigned c, unsigned n)
{
    unsigned m = (1 << (n-1));
    return (x * a + c) % m;
}

Where 'x' is the original value, and the return result is my new pseudo-random translation, a and c and m are the constants in the linear-congruent-generator formula, but m is just derived from my n, the number of bits in the IPv4 CIDR suffix.

Translating this into English, these three rules for choosing the constant:


  • The constant c must be relatively prime with m, meaning it has no prime factors in common. Well, since m is always a power of 2 in my example, that's easy. It must means that c must be an odd number. More generally, just always choosing a prime number for c will work even for IPv4 address ranges that aren't CIDR blocks.


  • The value a-1 is the opposite of relatively-prime, meaning it's divisible by all prime factors of m. Again, that's easy, as m is just a power of 2 in my example. Therefore, the value a-1 must be an even number, or again, a must only be an odd number. For IPv4 address ranges that aren't CIDR blocks (where m isn't a power of 2), then we'll have to first enumerate the prime factors of  m, then divise an a built from the prime factors of m.


  • The value a-1 must also be divisible by 4. That's easy to do, too.

  • This means that my original statement is false about the translate8() function above: it actually does work, at least in terms of producing a 1-to-1 correspondence. It'll do that for all values of n between 1 and 32 bits. I find that counterintuitive, so but it's true. I've written code to test this assertion, and it works.

    But, the numbers aren't very random. Testing it out, I find that the first few numbers going from 0 to 10 translate to the sequence:

      0 -> 195
      1 -> 192
      2 -> 189
      3 -> 186
      4 -> 183
      5 -> 180
      6 -> 177
      7 -> 174
      8 -> 171
      9 -> 168

    Therefore, for good randomness, I need to choose better constants. Choosing a=993319 and c=193939 results in:

      0 -> 113
      1 -> 152
      2 -> 191
      3 -> 230
      4 ->  13
      5 ->  52
      6 ->  91
      7 -> 130
      8 -> 169
      9 -> 208

    Looking on the longer list, it oscillates from 3 digit numbers for a bit, then 2 digit numbers for a bit, and back to 3 digit numbers, so it's not completely random as I can see a pattern there. This is random enough for my purposes, but ideally, I'd like to choose numbers with a bit better randomness.

    So, what's are some good tests of randomness I can apply? Ideally, I'd like to choose a single set of 64-bit sized constants that are "sufficiently random" for any sized n from 1 to 48 bits. In particular, when scanning the entire Internet (all 4-billion targets), I'd like every Class C subnet to get its share of the traffic (1/16-millionth), rather than having some Class C subnets hit very hard with traffic for a while while others are inactive for a bit.










    8 comments:

    Andrew Yeomans said...

    Check out card-shuffling algorithms, such as Fisher-Yates shuffle http://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle. In this case you have a deck of 256 cards (or port numbers) which you want to shuffle to a different order.

    martijn said...

    I'm not sure if Fisher-Yates shuffling would work if you have roughly 32 trillion numbers - that's how many 48-bit numbers there are.

    It works for 8 bit of course.

    In your example with 214013 and 2531011, isn't the only property of 214013 you need that it divides 2^32+1, in other words there is a number z such that z*214013 equals 1 mod 2^32?

    In that case *any* odd number would have worked, and for 2531011 you could have taken any number.

    It is all very non-random really, but I guess all you need is for it to be opaque enough so that it appears random to anyone seeing the numbers. By choosing the odd number fairly large this criterion is fulfilled.

    Jess said...

    Citing Knuth, wikipedia has that an LCG defined:

    x_n+1 = a * x_n + c (mod m)

    has a full period when c != 0 iff:

    c and m relatively prime
    a - 1 divisible by all prime factors of m
    a - 1 divisible by 4 if m divisible by 4

    It seems pretty easy to generate parameters with these properties. It would be harder to generate nice-looking parameters, but I don't think you need crypto-strength randomness for this application.

    martijn said...

    Sorry, when I say "divides 2^32+1" I mean it divides that number MODULO 2^32. In other words, it has a (multiplicative) inverse modulo 32.

    And ^ means to the power of.

    martijn said...

    Ah, forget all I said, I misread your definition of translate32 - I totally missed the LGN bit. Sorry.

    Jess has the right answer.

    martijn said...

    "So, what's are some good tests of randomness I can apply"

    (Preudo-)randomness is a statistical property stating things that on average one tenth of the numbers in any block ends on a 1 (in decimal notation), etc. etc. It does not mean that in any block of length ten, exactly one number ends on a 1.

    It's easy to confuse randomness with "evenly spread out in a non-obvious way". If, for instance, you'd ask people to order numbers "in a random way", they're more likely to do the latter.

    I'm not sure if for what you intend to do, it matters whether it's really random or just evenly spread out. The final paragraph of your update suggests the latter is what you're really interested in. I guess looking at the numbers it generates and making sure there isn't an obvious (and, most importantly, predictable) pattern.

    Btw, when you choose 993319 and 193939, aren't you computing mod 256? So is there a reason why you don't choose 39 and 147 (which are these respective numbers mod 256) in the first place?

    Anonymous said...

    http://en.wikipedia.org/wiki/Linear_feedback_shift_register

    Unknown said...

    The developer preferred that I did not release the name of the software Security Testing. However, he has reassured me that most of the items have since been fixed which is great. We've spoken, and i've decided to do another round of testing in the next release, which will involve looking at the vulnerabilities already identified, and a new round of testing, hopefully covering more advanced vulnerabilities.