PDA

View Full Version : Howard and those that write the software



muttley
01-31-2003, 12:05 AM
Hello Howard and those that write the software.

I am writing in regards to some some questions that have been posted in some of our 3 forums that Team Anandtech has.

The reason that I am bring this up is for quality assurance.

Some of my younger more inquisitative teamates brought up a postulation that if they erased their directory and reinstalled the program they thought that it seemed to them that at the beginning they seemed as if they could get a few more low values for the protein.

Now the reason I mention younger is that I am in the 40's and started in electronics at age 13 in the vacuum tube era.

My question and observation is how in the software are you tied to making of 'a random generator.'

Your results may have a 'random generator' that is tied to the hardware in the computer and the random generator in the computer is not truely random.
As an example I worked at one time in Nevada on slot machines. These slot machines had to have a seperate hardware random generator. This was Nevada Gaming Comission that regulated, inspected, approved, and approved all things. (Tuff bunch to please.) (FYI {for those reading for other reasons} reel on slot machines reuse/rechose the same picture thus making the ability to produce 16+ million to one possibilities, there are also higher odds possible. The reels were popular cause a transition of age groups wern't ready for video screens. The are not random spins now but computer controlled motors and optic sensors)

I also at one time had a program written in basic that filled in the screen with smile faces the old reverse color smile. The results played out the same on every different computer and as I recall the last spot to be filled in was the fourth spot from the top right.

You have accumilated so much data I would hate for data collection to go on then find out a imperfect ramdom generator to skew the results.

best regards,
muttley

bwkaz
01-31-2003, 09:00 AM
Originally posted by muttley
I also at one time had a program written in basic that filled in the screen with smile faces the old reverse color smile. The results played out the same on every different computer and as I recall the last spot to be filled in was the fourth spot from the top right. Let me guess -- your program was randomly choosing spots to print a smilie character?

Note that the reason it worked the same on every computer was that you weren't using a true random number generator. Rather, it was a pseudorandom number generator, probably using a linear congruential recurrence (num_n = a*num_(n-1) + b (mod m), where a, b, and m are parameters that govern how evenly-distributed the random numbers are). This kind of recurrence needs a seed value for num_0.

If the seed value gets set to the same thing on two separate runs of a program, then the same sequence of random numbers will always result.

This is very likely what your program did. I remember back when I was writing QBasic stuff, that if you did a Randomize Timer (to set the seed to the current value of the system timer), then you'd get different sequences. If you don't call Randomize at all, then you'll always get the same sequence.

DF, on the other hand, uses the C library's rand() function, which acts much the same way. It requires a seed as well. What they use for the seed, though, is a combination of current system time when the program is started, and the program's process ID. That way, even if you start 2 instances at the same time, they'll have different PIDs, and different random seeds.

Brian the Fist
01-31-2003, 09:37 AM
Actually, we do NOT use 'rand'. As I believe I have mentioned earlier, it is based on Knuth's "Algorithm A",
Knuth, D. E. (1981). The art of computer programming, volume 2, page 27.

There is no reason why re-installing should get you 'lower' values. Everything is completely random and the seed is obtained from your clock and process ID, the 2 most random things readily available on a machine.

AMD_is_logical
01-31-2003, 10:30 AM
Originally posted by Brian the Fist
Everything is completely random and the seed is obtained from your clock and process ID, the 2 most random things readily available on a machine. What resolution do you use for the time (with the Intel-compiled Linux version of the client)? Is it every second, every hundreth of a second, or what?

I have a small cluster, and if the nodes start up together, then they will have the same PID for the client. They will also have the same time except for clock drift. (I haven't been setting the time on the nodes, but they seem to come preset to about the right time for some time zone.) Perhaps I should set the clocks to various random times.

The biggest differences between the nodes are the network MAC, and the network address that gets assigned to each one.

Brian the Fist
01-31-2003, 12:53 PM
Yes, MAC address or IP is another possibility. Anyhow, if you have a cluster, it couldn't hurt to put a small one-two second delay between the start on each node, as there is indeed a small chance some might get the same random seed in that case. Time is to the nearest second only. However, having a cluster myself here, we have not really experienced that problem ourselves.

m0ti
02-01-2003, 07:02 AM
Actually, Dyyrayth is helping me with analyzing the data.

We're looking at how long (i.e. how many folds) it takes to improve upon RMS values, and by how large the improvement is.

The results will most likely vindicate the RNG used. They should also make it possible to predict the number of folds needed to beat a certain RMS more accurately.

In any case, the results should be interesting.

Brian the Fist
02-01-2003, 10:15 AM
We have already done an extensive research paper on the subject (the Proteins 2002 article in the reference list in the Science area of the web site) which goes into more detail on this than you possibly could so if you are interested, grab a copy of the article and take a look (I can send it as a PDF I think if its not already on the web site somewhere).
And it has nothing to do with the RNG, the distribution of RMSD fits an 'extreme value distribution' (look for that on google for info)

m0ti
02-01-2003, 01:14 PM
I was thinking along the lines of pulling the results from the stats for making predictions.

In any case, I'll have a look.

> And it has nothing to do with the RNG, the distribution of RMSD fits an 'extreme value distribution' (look for that on google for info)

Yes, I remember you mentioning that. We had some concerns about RMS's behavior, which I'm treating as only dependent on the protein, and the RNG.

For example, on the last protein, from the stats Dyyrayth's provided me with, on myself, over the first 1.8M folds I made on the protein (approximately) my best RMS improved from around a 10.4 to an 8.29. After that point it took me another 4M folds to improve to 8.26, and then another 3M folds to improve that to 8.26 (improvement was 0.002575).

This might be expected if the peak in the distribution was very very steep; there's no way to know that without getting more results. In any case, my best RMS was beaten by quite a lot of people so perhaps there is something strange going on... (pure speculation)

Mind you, this obviously has no significance until I assemble results for a large number of users and can look for statistically valid behaviour.