I mentioned seeds in some of the previous slides. So, let's talk about how you get these seeds, where do they come from and what do you do. Conceptually, when you see the pseudorandom number generator, you're going to pick a different seed every time. The problem is, these seeds are typically chosen within a fairly narrow range and thus are predictable. Maybe not exactly but it's going to be a small set. The best example of this occurred in the 90's when SSL, the Secure Socket Layer Protocol was just beginning. Nowadays of course SSL is defunct. Use TLS but back then SSL was seen as quite secure and it used pseudorandom numbers because for a variety of reasons. It would generate a seed in order to initialize the number generator. Well, a couple of grad students at Berkeley asked, well how the heck did they get the seed and they did some poking around. They found out it use the combination of the time of day, the process identification and the parent process identification. Now, the identification numbers for processes on this particular system were always between about maybe 100 or a 150 to 31 thousand to around 32 thousand. That's actually fairly small yet 32 thousand possibilities. Furthermore, the parent process ID is almost always less than the child process ID so that reduce the search space even more and of course with the time of day you simply look at when the message was sent and that gives you the time of a fairly narrow range. The bottom line is this range is easy to search and the grad students did it and they were able to break all of the Netscape's SSL. Netscape subsequently hardened the way the seed was selected. But this is a good example of one of the dangers of a pseudo-random number generator. I don't have to figure out from the list of previous elements what the next element will be. I just have to find the seed and once I get the seed, I've got everything. So, the next slide talks about some common seeding errors. You have to be careful here because many random number generators for example, by the way when I say random number generators, I'm including pseudo-random number generators and for this context, it would all be pseudo-random number generators. Anyway when you look at the output of pseudo-random number generator, it may say we produce 32 bits of output. Great. What's the seed? If the seed is selected for example from 256 numbers, you have two to the eighth bits of randomness 256 bits of randomness. You do not have two to the 32nd bits of randomness. So, it doesn't matter how big the internal state is. The question is how do you initialize it and the number of bits in what you initialize, is the number of bits of uncertainty in the random number generator. Therefore the maximum randomness you're going to get in the output. Using the current time is a bad idea because if I figure out when that was sent, I can just try things until I get your numbers. One thing that's sometimes is done is to take the current time and then hash it using a one-way hash function which we'll talk about in a bit. Then use that to drive the random number generator. Basically the attitude is so what. If I know you'd based it on time and I know you're hashing function I can reproduce exactly what you did. The other thing is to be careful of is when you're using time, look at how often the clock ticks. You will often see things saying, "Well, we require time to the nearest millisecond or the nearest nanosecond or something like that." If it's a quarter oscillator on your system, probably the oscillator increments the hardware counter for time every 60th of a second. So, even if you're recording time in milliseconds, you're only going to get instead of a thousand intervals per second, you're only going to have 60 because the clock only ticks once every 60th of a second. So, beware of that. Another error that I've seen which is highly amusing to me at least probably to you too is. You use the time as the seed and then in the header of the message you put the time the message was sent. Now, the two probably won't match exactly but they're going to be awful close and you can use that to leverage, to figure out what time that figure out your seed. So, in general that's an example where most people wouldn't think that they're disclosing the seed but they actually are. Now there are a couple of common fallacies with pseudo-random number generation. One of them is, all right I'll use a small c but I'm really going to use a complex algorithms so nobody will be able to figure it out. Well there are two problems with that. The number one thing in cryptography when you're doing cryptonalysis or rather of creating a cipher, is you assume the adversary knows everything about that cipher except the key. The second fallacy here, is that the manipulation adds uncertainty. It doesn't. If you take eight bits, I don't care how complex your manipulation is. If you don't bring in anything more at the end the uncertainty is simply going to be those eight bits. Also some manipulations look good but they turn out to be pretty poor. Donald Knuth is a classic example of this in his The Art of Computer Programming which I highly recommend to anyone doing any serious programming. He talks about random number generation extensively in volume two if I remember correctly and then in the introduction he talks about how he tried to write his own and if you look at the algorithm which he shows it is hideously complex. It also turns out to be very bad because the number start repeating themselves very quickly. So, just because it's complex doesn't mean it's going to be any good. Now, an alternative that some people have used is silicon a database or a book or something like that. The idea is for example if I'm generating a key or a seed by looking in a book, well, you're going to have to know which book and so on and so forth and the like. The problem is once they figure out the book, then the number of seeds is limited. So I can just go through and if I can find that the right one, right position in the book I will immediately find everything you've done including the random numbers you've generated. So, even if it's from a very large database. If I can figure out which element that a database you used and I presumably have access to the database, then I can regenerate your random numbers and I can read your messages or do whatever it is you're doing with those random numbers. So that's the bad. How do you do the good? First one is physical randomness. If you can use this to generate random numbers so much the better, if you can't you can only do a small number of them you can use this for the seed. Now, if you're working with a physical resource though, be sure that it's random. Sometimes unexpected events can change that. As an example, I mentioned the ambient noise in a room is sometimes very good for randomness. Well, someone was using that and people noticed the pattern which should never happen. He couldn't figure out why until he investigated the room and it turned out that outside the room a storey down was a generator used in a project on the floor below it. It turned out at certain times of day the generator was turned on and when it was turned on it made a sound like, "ka chunk." a very regular noise. So, when the microphone that was picking up the ambient noise was used, if it happened to be used during those times it will pick up the generator noise and that introduced enough non-randomness to create a problem. Another thing you can do if you don't want to use ambient noise or to avoid generators or things like that, you can use things about the user. Now, when users type, they'll break pauses between the keys that are also differences in how long they hold the key down especially if repeat is off on the keyboard. So, you can use those events. What's the time between keystrokes, how long the keystroke is down and so forth. If the user is a hunt and peck typist, then that will probably be pretty random. If the user is a trained typist it may not be. So, you'll have to worry about that a little bit. The other thing you can do is use system characteristics. A lot of data on the system changes very rapidly and in fairly unpredictable ways most of the time. For example if you take the output of the process listing of top or PS, add in all of the goodies like, you know, memory usage, CPU usage, state of the process, full name of the process, full environment lists that sort of thing, then you can get data that's often very hard to guess and since people can't recreate the state of the system presumably, they won't be able to recreate this. So, once you get all that information you cryptographic hash to hash it down to the number of bits you want and then use that. Physical sources are best if you can find them but if not this is a pretty good substitute. Now, physical sources where do you get them? Well, I mentioned microphones but most systems nowadays have two sources that are actually built in the devices. The /dev/random and /dev/urandom. /dev/random both of these return random bytes and what happens is when the kernel is running it generates random bits and puts them into a pool. When you read /dev/random it starts reading when the bytes from the pool. When it runs out of bytes and bits in the pool, in other words it's used all the ones that are there, it blocks until the kernel adds more and then it will unblock. /urandom is similar except that it never blocks if it runs out of bits in the pool, it starts using a pseudo-random number generator. For that reason, /dev/random is the one you want if you're doing anything cryptographic. You don't want /dev/urandom because if it runs out of bits in the pool, it starts using a pseudorandom number generator which means in essence from that point on, the bits you're getting are all pseudo-random and hence can probably be guessed.