Saturday, October 27, 2012

Randomizing port scans, part two

I want to randomize port scans for large-scale scanning of the Internet, so I apply a little theoretical math to the problem. The solution I want is something like a "linear congruential generator" like that used in 'rand()'. Using this, I can take an in-order range of IP addresses and produce a randomized order  for the output. While the order is randomized, there are no duplicate or missing IP addresses in the output.

The idea of using an LCG like 'rand()' is nothing new. Internet worms of the last decade frequently did this in order to randomly select targets. This is one of the famous things about the Witty worm, for example. Instead taking a 32-bit output from the LCG, which would've traversed every IP address, it combined two 16-bit outputs, which actually duplicate some IPs while missing some others. My guess is that they didn't understand this property of LCGs.

I'm hazy about LCGs too, so I wrote some code this morning to help me out. Given a range, an LCG needs two parameters (an 'increment' and 'multiplier') specific to that range so that it'll evenly cover it (without the gaps you saw in Witty). Therefore, for my port scanner, I need to write a function that takes the size of the range and generates these constants for me.

Note that by "range" I mean "all output". For example, I might want to scan 10.0.0.0/24 and 192.168.0.0/23. This is actually three Class C ranges, or 768 addresses. Therefore, my range is 768. I then convert numbers from [0..767] to a randomized order [0..767], then from the resulting index figure out the appropriate address in 10.0.0.x, 192.168.0.x, or 192.168.1.x.

I wrote a program "lgctest" to help me figure things out. It's at http://pastebin.com/U5j45dqU. It's probably not the most efficient way of doing things, but it works.

The following is an example of running this for a range of 100:

$ ./lcgtest 100
elapsed     = 0.000-seconds
factors     = 2 5 
m           = 100                      (0x64)
a           = 21                       (0x15)
c           = 29752039459              (0x6ed5c1823)
c%m         = 59                       (0x3b)
verify      = success
 59  98  17  16  95  54  93  12  11  90  49  88   7   6  85  44  83   2 
  1  80  39  78  97  96  75  34  73  92  91  70  29  68  87  86  65  24 
 63  82  81  60  19  58  77  76  55  14  53  72  71  50   9  48  67  66 
 45   4  43  62  61  40  99  38  57  56  35  94  33  52  51  30  89  28 
 47  46  25  84  23  42  41  20  79  18  37  36  15  74  13  32  31  10 
 69   8  27  26   5  64   3  22  21   0 

This output suggests that a rand() function should have the constants a=21 and c=59 (or c=29752039459, the two are equivalent for this sized range). It seems roughly random. Note the "verify" line in the output, I actually go through all 100 inputs and make sure there's a 1-to-1 correspondance with outputs.

If I don't like how this output looks, I can play with the output. I can scale 'a' to make it bigger, and I can choose a new 'c'. This might look like:

$ ./lgctest 100 -a 5 -c 324578
elapsed     = 0.000-seconds
factors     = 2 5 
m           = 100                      (0x64)
a           = 101                      (0x65)
c           = 324579                   (0x4f3e3)
c%m         = 79                       (0x4f)
verify      = success
 79  58  37  16  95  74  53  32  11  90  69  48  27   6  85  64  43  22 
  1  80  59  38  17  96  75  54  33  12  91  70  49  28   7  86  65  44 
 23   2  81  60  39  18  97  76  55  34  13  92  71  50  29   8  87  66 
 45  24   3  82  61  40  19  98  77  56  35  14  93  72  51  30   9  88 
 67  46  25   4  83  62  41  20  99  78  57  36  15  94  73  52  31  10 
 89  68  47  26   5  84  63  42  21   0 

This suggests a=100 and c =79 as the result.

I can do the same for large. Internet-scale ranges 48 bits in size (32-bits for all IPv4 address plus 16-bits for all TCP ports):

$ ./lcgtest 0xFFFF11223344
elapsed     = 0.111-seconds
factors     = 2 5 7 17 120779 
m           = 281470969197380          (0xffff11223344)
a           = 287454021                (0x11223345)
c           = 29752039459              (0x6ed5c1823)
c%m         = 29752039459              (0x6ed5c1823)
    29752039459 129477099060178 138115776158885 158255256164732 245075681732595 
 63101336822486  36056258290077  25571150768180  87087842370691 225796371129958 
 94883588791729 171894129574160 124229081887331   6275381503570 236992335315477 
230599320844148 100962070283779 260982239402670  57997427109973 128277349806064 
182393949816195 210798294566662 115893215610477 215841374552996  76470511275027 
183713610051950 144301551524457 140100706553148 126928985298519 111701253346686 
 68030510696885 143976996651744 121355664207475 270576512989058  79943559536981 
263463695922080  28021267093619 214812319847162 147754892116809  55705224523516 
173095668627947  36132003754514 214742973872345 123657560037808  25091829930571 
107257802788594 119823885532977 175234342767144 261815118019623 188902199494218 
271785087019749 136802464357228  31061667861631 178659776996566  22335078281081 
 88369367982352  95159876269879  63638911342906  19183467015689 247530054018580 
 48329457485007 233264902567858 119421643902793  91760833132132  60341311614927 
258658908477106  79848592801365 136562661568000 220374733910891  43981247648086 
 76866447960905 258330261720912  99923749379611   4606674052998 277040488536321 
 73889369740808 207063069403391   8310519823914 270793102152909 154411618681844 
163920610154647 175790443134582 162257446416457  89626474177888  90834012239551 
150515597826250 105252230322289 194041561786088 228144226407979 275169639739278 
281155901215805 273009672383288 102381780601523 171304091674862 246112192874857 
273325625747816  24005810071231 152379449320186 149253710437461 140260353097400 

Note the "elapsed". It took my MacBook Air 0.11 seconds of compute to do this calculation. My Atom netbook took even longer, almost 2 seconds. It takes that long to run the "Sieve of Eratosthenes" for the underlying prime factorization. Instead of calculating primes on the fly I could include a megabyte file with my code, but I don't want to do that, either. If I knew my maths better, I could probably find a more efficient algorithm.

These outputs look random to me, so it looks like I can just plug in my code into a port scanner and have easy randomization. Then I ran this, using '101' as the input instead of '100':

$ ./lgctest 101
elapsed     = 0.000-seconds
factors     = 101 
m           = 101                      (0x65)
a           = 102                      (0x66)
c           = 29752039458              (0x6ed5c1822)
c%m         = 10                       (0xa)
verify      = success
 10  20  30  40  50  60  70  80  90 100   9  19  29  39  49  59  69  79 
 89  99   8  18  28  38  48  58  68  78  88  98   7  17  27  37  47  57 
 67  77  87  97   6  16  26  36  46  56  66  76  86  96   5  15  25  35 
 45  55  65  75  85  95   4  14  24  34  44  54  64  74  84  94   3  13 
 23  33  43  53  63  73  83  93   2  12  22  32  42  52  62  72  82  92 
  1  11  21  31  41  51  61  71  81  91 

It works as I wanted in that there is a 1-to-1 relation between inputs and outputs, but the outputs aren't too random.

So I change the parameters a bit, and I get something visually looking more random:

$ ./lgctest 101 -a 5 -c 324578
elapsed     = 0.000-seconds
factors     = 101 
m           = 101                      (0x65)
a           = 506                      (0x1fa)
c           = 324578                   (0x4f3e2)
c%m         = 65                       (0x41)
verify      = success
 65  29  94  58  22  87  51  15  80  44   8  73  37   1  66  30  95  59 
 23  88  52  16  81  45   9  74  38   2  67  31  96  60  24  89  53  17 
 82  46  10  75  39   3  68  32  97  61  25  90  54  18  83  47  11  76 
 40   4  69  33  98  62  26  91  55  19  84  48  12  77  41   5  70  34 
 99  63  27  92  56  20  85  49  13  78  42   6  71  35 100  64  28  93 
 57  21  86  50  14  79  43   7  72  36 

So the new parameters make this look more randomy. Now I need to figure out how to do this programmatically, to test for randomness in the output, and then change the parameters until I get something acceptably random.

Once I figure this out (regenerations constants to achieve randomness), I'll update this post with the solution.

9 comments:

Unknown said...

You can just use XOR, if you're not terribly picky about how "random" your random is.

martijn said...

Your second output is very non-random.

Like you can substitute its equivalent mod a for c (i.e. you can use c%m instead of c), the same holds true for a. So instead of a=101, you get the same results if you use a=101%100=1. So your LGM function is x(n+1)=x(n)+79.

So the output you get are just the last two digits of the multiples of 79.

It's the same problem as you get in your fourth example, with (effectively) m=101, a=1 and c=10, except this is more obvious.

Robert Graham said...

Thanks Martijn! I didn't realize that a%m applied, though in hindsight, that's obvious.

Robert Graham said...

Ryan, XOR doesn't quite work because my ranges aren't powers of 2. In other words, if you XOR something with a number [0..99] you'll get something outside that range.

Though, most of the times the lower part of the range is indeed a power of 2. In other words, I might scan 3 Class C networks (a range of 768 addresses). I can therefore use the LCG to determine an index [0..767], and then XOR the lower 8 bits with 0xA3 (a constant pulled out of thing air).

Jerry said...

You're making this way too complicated, without defining a threat model. Suppose you want to scan a range of ports from m to n. If you're worried that it's too obvious if you increment by 1, then pick any
k < n - m + 1 and relatively prime to n - m + 1 use next(j) = m + (j + k mod n - m + 1). If that's not good enough ... why? Are you worried about an intrusion detector that will notice linear scans? Then why not an intrusion detector that will notice an LCG - the parameters of an LCG can be read out from two consecutive outputs.

If you're really concerned about this, there are published algorithms going back years for how to apply an encryption algorithm in order to get a mapping from a specific input range of numbers back to itself. See http://csrc.nist.gov/groups/ST/toolkit/BCM/documents/proposedmodes/ffsem/ffsem-spec.pdf for an example.

Anonymous said...

You don't search for an obsolut randmonesse. I used some time ago a little trick:

Make a table of all values, for example, 100 elements.
Choose one, place it on a new table as firstelement. Extract this one.
You have now a 99 elements table. Choose one, place it in second place, etc..
After 100 rounds, you have a shuffled table of your 100 elements.

martijn said...

Regarding that last comment: that works fine if you have a small set. (And even then it's rather inefficient.) It won't work if you have 32 trillion IP-port combinations...

Unknown said...

hi :)
thanks for sharing, you have a very nice and useful blog.
btw, I also have a blog and a web directory, would you like to exchange links? let me know on emily.kovacs14@gmail.com

Jess said...

I'm not sure if anyone besides Emily the SEO Spammer cares about this issue still, but have you considered Fibonacci hashing? Described here, it essentially is LCG, except the multiplier is easily derived from the size of your domain. It's certainly not random, but it picks out an always-evenly-distributed and progressively denser set. I think that's what you wanted?