I mentioned seeds in some
of the previous slides. So, let's talk about how
you get these seeds, where do they come from
and what do you do. Conceptually, when you see the pseudorandom
number generator, you're going to pick
a different seed every time. The problem is, these seeds are typically chosen within
a fairly narrow range and thus are predictable. Maybe not exactly but it's
going to be a small set. The best example of this
occurred in the 90's when SSL, the Secure Socket Layer Protocol
was just beginning. Nowadays of course
SSL is defunct. Use TLS but back then SSL
was seen as quite secure and it used pseudorandom numbers because for a variety of reasons. It would generate a seed in order to initialize
the number generator. Well, a couple of grad students
at Berkeley asked, well how the heck did
they get the seed and they did some poking around. They found out it use the
combination of the time of day, the process identification and the parent process
identification. Now, the identification numbers for processes on this
particular system were always between about maybe 100 or a 150 to 31 thousand to
around 32 thousand. That's actually fairly small yet 32 thousand possibilities. Furthermore, the parent process
ID is almost always less than the child
process ID so that reduce the search space even more and of course with the time
of day you simply look at when the message was
sent and that gives you the time of
a fairly narrow range. The bottom line is this range is easy to search and
the grad students did it and they were able to break all of the Netscape's SSL. Netscape subsequently
hardened the way the seed was selected. But this is a good
example of one of the dangers of a
pseudo-random number generator. I don't have to figure out from the list of previous elements what
the next element will be. I just have to find the seed and once I get the seed,
I've got everything. So, the next slide talks about some common seeding errors. You have to be
careful here because many random number
generators for example, by the way when I say
random number generators, I'm including pseudo-random number generators
and for this context, it would all be pseudo-random
number generators. Anyway when you
look at the output of pseudo-random
number generator, it may say we produce
32 bits of output. Great. What's the seed? If the seed is selected for
example from 256 numbers, you have two to
the eighth bits of randomness 256 bits
of randomness. You do not have two to
the 32nd bits of randomness. So, it doesn't matter how
big the internal state is. The question is how
do you initialize it and the number of bits
in what you initialize, is the number of bits
of uncertainty in the random number generator. Therefore the maximum randomness you're going to
get in the output. Using the current time is a bad idea because if I figure
out when that was sent, I can just try things
until I get your numbers. One thing that's sometimes is done is to take
the current time and then hash it using
a one-way hash function which we'll talk about in a bit. Then use that to drive
the random number generator. Basically the attitude
is so what. If I know you'd based it
on time and I know you're hashing function I can
reproduce exactly what you did. The other thing is
to be careful of is when you're using time, look at how often
the clock ticks. You will often see
things saying, "Well, we require time to
the nearest millisecond or the nearest nanosecond
or something like that." If it's a quarter oscillator
on your system, probably the
oscillator increments the hardware counter for
time every 60th of a second. So, even if you're recording
time in milliseconds, you're only going
to get instead of a thousand intervals per second, you're only going to have
60 because the clock only ticks once
every 60th of a second. So, beware of that. Another error that I've
seen which is highly amusing to me at least
probably to you too is. You use the time as the seed and then in the header
of the message you put the time
the message was sent. Now, the two probably won't match exactly
but they're going to be awful close and you
can use that to leverage, to figure out what time
that figure out your seed. So, in general that's
an example where most people wouldn't
think that they're disclosing the seed
but they actually are. Now there are a couple
of common fallacies with pseudo-random
number generation. One of them is, all right I'll use a small c
but I'm really going to use
a complex algorithms so nobody will be able
to figure it out. Well there are
two problems with that. The number one thing
in cryptography when you're doing cryptonalysis or rather of creating a cipher, is you assume the adversary knows everything about
that cipher except the key. The second fallacy here, is that the manipulation adds
uncertainty. It doesn't. If you take eight bits, I don't care how complex
your manipulation is. If you don't bring in
anything more at the end the uncertainty is simply
going to be those eight bits. Also some manipulations look good but they turn out
to be pretty poor. Donald Knuth is
a classic example of this in his The Art of
Computer Programming which I highly recommend to anyone doing any serious programming. He talks about
random number generation extensively in volume
two if I remember correctly and then in the introduction he talks about
how he tried to write his own and if you look at the algorithm which he shows
it is hideously complex. It also turns out to
be very bad because the number start repeating
themselves very quickly. So, just because it's complex doesn't mean it's
going to be any good. Now, an alternative that some
people have used is silicon a database or a book or
something like that. The idea is for example
if I'm generating a key or a seed by
looking in a book, well, you're going to
have to know which book and so on and so
forth and the like. The problem is once they
figure out the book, then the number of
seeds is limited. So I can just go
through and if I can find that the right one, right position in the book I will immediately find everything you've done including the random numbers
you've generated. So, even if it's from
a very large database. If I can figure out
which element that a database you used and I presumably
have access to the database, then I can regenerate
your random numbers and I can read your messages or do whatever it is you're doing
with those random numbers. So that's the bad. How
do you do the good? First one is physical randomness. If you can use this to generate random numbers so
much the better, if you can't you can only do a small number of them you
can use this for the seed. Now, if you're working with
a physical resource though, be sure that it's random. Sometimes unexpected events
can change that. As an example, I mentioned the ambient noise in a room is sometimes very
good for randomness. Well, someone was using that and people noticed the pattern
which should never happen. He couldn't figure
out why until he investigated the room
and it turned out that outside the room
a storey down was a generator used in a project
on the floor below it. It turned out at certain times
of day the generator was turned on and when it was turned on it made a sound
like, "ka chunk." a very regular noise. So, when the microphone that was picking up the
ambient noise was used, if it happened to be used
during those times it will pick up the
generator noise and that introduced enough non-randomness
to create a problem. Another thing you can do
if you don't want to use ambient noise or to avoid
generators or things like that, you can use things
about the user. Now, when users type, they'll break pauses between the keys that are
also differences in how long they hold the key down especially if repeat is
off on the keyboard. So, you can use those events. What's the time
between keystrokes, how long the keystroke
is down and so forth. If the user is a hunt
and peck typist, then that will probably
be pretty random. If the user is a trained
typist it may not be. So, you'll have to worry
about that a little bit. The other thing you can do is
use system characteristics. A lot of data on
the system changes very rapidly and in fairly unpredictable ways
most of the time. For example if you
take the output of the process listing of top or PS, add in all of the goodies like, you know, memory usage, CPU usage, state of the process, full name of the process, full environment lists
that sort of thing, then you can get data
that's often very hard to guess and since people can't recreate the state of
the system presumably, they won't be able
to recreate this. So, once you get all
that information you cryptographic hash to hash it down to the number of bits
you want and then use that. Physical sources are best
if you can find them but if not this is
a pretty good substitute. Now, physical sources
where do you get them? Well, I mentioned microphones
but most systems nowadays have two sources that are
actually built in the devices. The /dev/random and /dev/urandom. /dev/random both of these return random bytes and what happens is when the kernel is
running it generates random bits and puts
them into a pool. When you read /dev/random
it starts reading when the bytes from the pool. When it runs out of bytes
and bits in the pool, in other words it's used
all the ones that are there, it blocks until the kernel adds more and then
it will unblock. /urandom is similar
except that it never blocks if it runs out
of bits in the pool, it starts using a
pseudo-random number generator. For that reason,
/dev/random is the one you want if you're doing
anything cryptographic. You don't want /dev/urandom
because if it runs out of bits in the pool, it starts using a
pseudorandom number generator which means in essence
from that point on, the bits you're getting
are all pseudo-random and hence can probably be guessed.