View Single Post
  #63 (permalink)  
Old 01-11-2012, 10:37 AM
orz
Guest
 
Posts: n/a
Default Re: Fast and small Multiplicative Random number generator

On Jan 10, 4:38 pm, sfuerst <svfue...@nospicedham.gmail.com> wrote:
> On Jan 6, 3:31 pm, orz <cdh...@nospicedham.gmail.com> wrote:
>
>
>
>
>
>
>
>
>
> > On Jan 6, 11:31 am, sfuerst <svfue...@nospicedham.gmail.com> wrote:

>
> > > > The interstate correlation seems bad at all addresses, including
> > > > sustained interstate correlation. With a statesize of only 2**128
> > > > that will likely cause problems in a number of parallel usage
> > > > scenarios. Though in some scenarios that can be made up for by the
> > > > use of the address. Which is a practice that I disapprove of - it
> > > > makes results non-reproducible, breaks some uses (including some ofmy
> > > > RNG tests...), and makes testing harder (I could, for instance, have
> > > > missed the problems that occur at some addresses). In short, it makes
> > > > everything messier by violating the abstraction layer I expect RNGsto
> > > > obey. The benefits it provides, while potentially comforting for some
> > > > multithreaded users, are of very little utility when paired with high
> > > > quality RNGs that are seeded correctly - I attempt to provide similar
> > > > but more flexible benefits to my (non-existant) users through a
> > > > combination of good RNG general properties (for that kind of purpose
> > > > statespace size and interseed correlations in particular) and
> > > > intelligent seeding interfaces.

>
> > > The advantage of using the address is that no communication is
> > > required at all between threads to guarantee that they get different
> > > sequences. Sequence-dividing schemes run into problems when you have
> > > threads that fork off other threads, and the resulting number of users
> > > is initially unknown. (Partitioning the seed-space becomes
> > > impossible.) The other issue is multi-machine codes using something
> > > like MPI. There, you have a number for each "rank", and need it to
> > > make the rng give a different sequence. (That is the commented out
> > > rsi as second parameter option.)

>
> > > Of course, since it seems that the quality of the resulting stream is
> > > strongly affected by the initial xor value, something needs to be
> > > fixed first.

>
> > I'm aware of the benefits. Fundamentally, the issue of trying to get
> > each thread an uncorrelated sequence is identical to the issue of
> > trying to get each run an uncorrelated sequence. You can solve it in
> > an automated fashion for the specific case of per-thread RNGs in
> > shared address space TLS by violating the abstraction layer, but that
> > is no help at all on sequential runs, distributed runs, parallel runs,
> > sequential threads, etc. Users who need decent random number
> > generators may be doing monte carlo simulations or whatever that want
> > a lack of correlation over any of those domains. Users who really
> > know what they're doing will be hurt more than helped by your method
> > (they might appreciate simplified thread seeding, but they don't want
> > to have to remember the details of how the abstraction layer is
> > violated), and those who don't will only be helped in a limited set of
> > scenarios.

>
> I don't think things are that bad. Often the simple case will "just
> work". Leave the complex interface for those who really need it, and
> have sensible defaults that work for most users.
>
> > You effectively get an extra 4-8 bytes of static state, without having
> > to store it in memory or even read it from memory, and those extra
> > bits of state are guaranteed to never share the same value as any
> > other RNG that is instantiated at the same time in the same address
> > space.

>
> I know. It feels like cheating... :-) It is the only way I know to
> get a random number generator to have a larger state-space than that
> bounded by the size of its seed. Breaking fundamental limits is
> always entertaining.
>

Actually, on second thought I want to qualify that slightly - you have
no control over the value of that state, so you have to
pessimistically assume it will be near-zero entropy, effectively far
less than than a pointer size worth of entropy. Really, the only
thing you can count on is uniqueness (for the life the RNG instance).
Which is, admittedly, pretty sweet. But I shouldn't have described it
in terms of the pointer size, size the pointer size is only a measure
of the maximum entropy involved, and has very little to do, in
practice, with the actual entropy involved.
>
>
>
>
>
>
>
> > My approach shares some similarities but is pretty different:
> > 1. I have a large enough statespace so that initializing them to a
> > random state is probabilistically good enough to eliminate
> > correlations over a reasonable domain, regardless of the nature of the
> > domain. That generally means at least 3 variables worth of state for
> > 16 to 64 bit RNGs, preferably 4 or 5. And definitely more for 8 bit
> > RNGs.
> > 2a. I supply an automatic seeding mechanism to users. They don't have
> > to use it, but I encourage its use for cases where neither crypto
> > security nor reproducible behavior is required. They can still
> > reproduce the RNG output by copying or serializing the RNG state, just
> > not by saving the seed since there is no visible seed in the
> > autoseeding case. The automatic seeding mechanism initializes an
> > internal RNG with a function of the program start time, the current
> > time, the current value of the stack pointer (which is unique per-
> > thread for the life of the thread), the address of the RNG it was
> > called on (the same basic thing you were using!, but without the
> > problems), the number of RNGs initialized within the current thread
> > (if TLS variables are supported on the platform), a few other things,
> > and on windows some platform specific stuff (I ought to add platform
> > specific stuff for unix sometime...). The internal RNG then
> > initializes the users RNG to a random state, in a fashion that the
> > users RNG algorithm can customize if it needs to (to preclude some
> > states, for instance). Some of those are precomputed at application
> > launch time, others are done at autoseeding time. The internal RNG
> > algorithm is chosen to be semi-fast (on 64 bit anyway... slower on
> > other CPUs), very high quality, guaranteed to have no short cycles,
> > have high quality seeding, be quick to seed, and be capable of seeding
> > from oddly structured data.
> > 2b. I also supply some more complex tools for high quality seeding for
> > cryptographic users and power users, who hopefully know what they're
> > doing. And I also supply a very simple seed-from-a-64-bit-integer
> > interface, which sometimes maps to an RNGs reference seeding algorithm
> > (if it has one that uses integers in the range supplied) and other
> > times maps to something similar to the autoseeder except
> > reproducible. All RNGs get all of those seeding mechanisms on an
> > identical interface. Additionally many RNGs also offer other custom
> > seeding mechanisms, and a few have default states that they go to
> > deterministically if not seeded. Those without default states require
> > a seed in their constructor to prevent accidents unless the user
> > explicitly requests otherwise (by passing a magic token to the
> > constructor) or is using the "raw" version of the RNG without the
> > normal interface. If an RNG that does not have a default state is
> > used without being seeded then its behavior is not defined.

>
> This is the "standard" technique for parallel generators... so there
> is nothing much wrong with it. That doesn't mean that other (perhaps
> simpler) methods don't exist. Sometimes the new methods do end up
> being better... That's how progress happens. :-)
>

Is it? I've seen several of the individual elements of my scheme in
other packages (TestU01, GSL, Boost, etc), but never more than one or
two of the elements in a single package, and a few elements I've never
seen elsewhere. I've seen your xor-with-address scheme several times
on the web, always combined with xorshift+multiply pairs, but never in
a library that I can recall.

I think that interface-wise, for the simple case, my scheme is as
simple as anything that allows multiple RNG states (ie simpler than
the libraries aimed at research, but not as simple as libc rand/
srand). Implementation-wise, my scheme is not simple... there ends up
being a call to the system RNG at program start, and some additional
hashing on a per-autoseeding basis and a tiny bit on a per-word-of-RNG-
state-autoseeded basis. For most uses it's overkill, but since many
users have a hard time figuring out just how broad a scope they need
lack of correlation over, and since the resources used are cheap, I
shoot for a little overkill.
>
>
>
>
>
>
>
> > No interthread or interprocess or intermachine communication is
> > required. It mostly works equally well for sequential RNG
> > instantiations in the same thread, in different threads, on sequential
> > runs, on simultaneous runs, on distributed machines, etc. The weakest
> > case at the moment is if a thread is created, then destroyed, then
> > very quickly recreated with the same stack address and TLS address.
> > Simultaneous launches on non-windows platforms are also weak. Every
> > other case that I can think of is strong, though not up to
> > cryptographic standards. The platform-specific stuff is done just
> > once per application launch. The requirements are:
> > 1. Either the library initialization function or the first automatic
> > seeding is performed before additional threads are launched. That
> > will tend to naturally be true for programs that don't go around doing
> > crazy things with the launch and linking processes.
> > 2. the degree of lack of correlation desired across the domain not
> > exceed the square root of the statespace size. That is almost
> > identical to a fundamental requirement for high quality RNGs that
> > would apply regardless, so it's no big deal.

>
> You can fix the rapid recreate problem by using a guid. Provided that
> the guid updates rapidly enough, there is no way other users can have
> the same initial seed state. The trick is to make the creation stall
> long enough to fulfill this speed limitation. Since thread creation
> is (meant to be) slow and rare, this isn't much of a constraint in
> practice.


If fixed the non-windows issues since my last post. I'm still not
sure what to do about the thread recreation issue. One solution would
be to use some sort of locking mechanism or atomic math, maybe
EnterCriticalSection or InterlockedIncrement on Win32/MSVC. But I'd
like to stay very portable... I currently use no libraries aside from
the C++ standard library (and, on windows, kernel32.lib), and nothing
compiler or OS or ISA specific without ifdefs and fallback paths on
unrecognized systems (admittedly the fallbacks are not very good for
OSes that are neither windows nor *nix...). Another solution would
be a malloc(1) on a per-thread basis, with no corresponding free. I'm
not familiar with "guid", but it appears to refer to a variety of
different algorithms for generating 128 bit numbers intended to be
unique across process boundaries and often network boundaries.
Apparently I would call UuidCreateSequential on win32, I'm not sure
what I would do on *nix? But that isn't really looking any better to
me in terms of portability or speed than the other options. Though it
could replace the RtlGenRandom and/or /dev/random usage on program
start.

> > The typical multithreaded user writes something like:

>
> > typedef PractRand::RNGs::Polymorphic::jsf64 RNG_TYPE;
> > __thread RNG_TYPE rng(PractRand::SEED_AUTO);

>
> > ...and is done, with the same basic benefits as your approach. The
> > more sophisticated user can do more sophisticated things though, no
> > one will be fooled in to thinking that a nondeterministic case should
> > behave deterministically, and serializing the RNG state for later
> > reproduction will work properly (which will let some of my more exotic
> > RNG tests behave correctly on it).

>
> I'll admit that the interface (if you can call it that) for the form
> of rng I've proposed isn't well-suited for C++. However, since I
> don't like C++ very much, I consider that a feature rather than a
> bug. :-P
>

I don't think it has issues with C++ really. Well, I suppose arguably
with object orientation in general since its state can't (or
shouldn't) really be treated like a normal object. But C++ would
handle it fine a simple function instead of a class. What I would say
is that it lacks reproducibility (via either manual seeding or state
serialization). RNG users sometimes want reproducibility for
debugging purposes, network games and apps, hashing purposes, probably
a few other uses I can't think of atm.
>
>
>
>
>
> > > Notice how it isn't using the adc instruction at all. It also seems
> > > to be rather poor code. :-( The above benchmarks as taking 5.8s to do
> > > a billion random numbers. The asm version I gave takes 4.2s Note
> > > that the fastest of the other rng's you gave takes 4.6s on this
> > > machine. I don't begrudge the compile much though... even simple
> > > rearrangements of the asm instructions cause the time taken to vary by
> > > up to 30%. It is very hard to come up with the right ordering to
> > > overlap memory accesses with computation.

>
> > 4.6 seconds, ow! I know performance at the high end gets weird, but
> > still I had hoped that at least one of the RNGs I posted would do
> > better. Several of them reliably take less then 2 seconds for that on
> > both my Core i5 2500 and my old Core2 3 GHrz using both 64 bit gcc and
> > less than twice that using 32 bit MSVC. All w/o asm.

>
> My computer is old and slow. Performance is relative, remember.


Yeah. Sigh. Even just looking at it as relative performance... Is
your optimizer allowed to inline your stuff and/or remap register
names, or does it have to stick to calling conventions and/or the code
as you wrote it? On my benchmarks, if I force it to not inline the
calls then they hit a performance maximum around 300 million calls per
second (though it varies quite a bit, from a low around 100 to a high
around 500, it's usually between 250 and 325), regardless of how fast
the RNG itself is. Even return 0 will get capped at that rate. If I
let it inline, even then it will still hit a cap around 1.1 billion
calls per second even for very simple RNGs like "int random() {int tmp
= a; a += b; return tmp;}". A simple "return 0;" is faster still, but
by so much that I suspect it optimizes away almost the entire calling
loop in that instance. And a few of the RNGs I've posted in this
thread get 0.7 to 1.0 billion calls per second here - just under what
looks like the maximum possible performance on this CPU / compiler
pair. MSVC results are very similar, though it will inline even
across compilation unit boundaries, unlike gcc (4.5.4) - I only see
the lower cap on MSVC if the call goes through a vtable.
In my attempts to achieve more meaningful performance measures I've
tried to look for aggregate test results across multiple compilers,
multiple CPU generations, multiple structures of calling loops, etc.
I had thought it helped a little, but it's hard to say really, and
often that gets tiresome and I'll skip most of it for extended periods
of time.

> > > Once I find a version with slightly better properties, I'll think
> > > about finding something the C compiler can convert into good asm.
> > > Previously, the trick was to use a union with an intrinsic 128 bit
> > > unsigned integer type.

>
> > > > ---
> > > > BTW, side note, is that the correct usage of movabs? In your code it
> > > > is used to move a constant in to a register, but my quick googling of
> > > > it suggested that it was only for reading from memory at absolute
> > > > addresses? Really I don't know much about x86-64 asm so I could
> > > > easily have just misunderstood what I read.

>
> > > "movabs "is the instruction gcc uses to move a 64 bit immediate into a
> > > 64 bit register, as you can see from the above. Don't ask me why they
> > > call it that rather than just plain old "mov". :-/

>
> > > Steven

>
> > The following variant of that:
> > Word tmp = high;
> > tmp ^= tmp >> (OUTPUT_BITS / 2);
> > tmp *= K;
> > tmp ^= rotate(tmp ^ low, OUTPUT_BITS/3);
> > Word old = low;
> > low += K;
> > high += old + ((low < K) ? 1 : 0);
> > return tmp;

>
> > ...does much better on statistical tests when used on 32 bit integers
> > instead of 64 bit integers. For the full 64 bit versions there is no
> > way to distinguish quality as both that and the original pass all
> > tests I've tried (when the xored value wasn't low-entropy), but
> > probably the 32 bit version is representative. It still has the
> > interstate correlation issues, but those are very hard to get rid of
> > on fast random access RNGs, and no worse than the ones in the CLCG I'm
> > looking to replace. Adding a second multiply and another xorshift
> > gets the interstate correlation issue under control without hurting
> > performance intolerably - users who want performance faster than that
> > can damn well use a non-random-access RNG. I might also try combining
> > successive elements like my hashed LFSR did to see if I can eliminate
> > the need for a 2nd multiply that way. I'm thinking of adding a third
> > word of state doing something dumb to get the statespace a little
> > bigger, but at this point really it's looking good enough to be worth
> > replacing the CLCG.

>
> I tried something much like the above, and it wasn't very good. It
> either had poor randomness, or failed to be fast enough to bother
> with.
>

That change had no performance impact on my computer (it's becoming
increasing apparent that we see very different performance impact from
changes), and did well on statistical tests when used on 32 bit
integers where the original (when changed to use 32 bit integers) did
poorly (IIRC failing after 5 minutes or so). A few other tweaks got
it to the point of passing an 8 hour test on 32 bit, I didn't try any
longer tests.
>
> > Hey, do you have any problems with people using any of this stuff,
> > modifying it, etc, potentially in closed source or commercial apps? I
> > try to keep all of my recommended RNGs public domain or very close to
> > public domain.

>
> No problems. It is just math, after all.
>
> Steven


Excellent, I'll probably be using something in the general ballpark of
this stuff for random access RNGs for the time being, until I can
figure out a way to get good random access without multiplication.



On Jan 10, 5:19*pm, sfuerst <svfue...@nospicedham.gmail.com> wrote:
> On Jan 5, 7:42*pm, sfuerst <svfue...@nospicedham.gmail.com> wrote:
>
>
>
>
>
>
>
>
>
> > This is the rng I'm investigating now. *(Still running BigCrush tests
> > on various 32bit functions of the 64 bit output.) * It doesn't fit all
> > of your criteria, but it fits most, and is very very fast.

>
> > It is a hash of the output of a 128 bit LCG. *The LCG uses a
> > multiplier of 2^64 + 1, so is easy to step. *Fast forwarding, and
> > reversing are trivial. *The hash is the most complex thing I've
> > managed to stuff into the cycles spent accessing L1 in the LCG update.

>
> > My constraints are a little different than yours, so more is effort
> > has been "spent" making sure parallel users can be assured of
> > different sequences. *(Hence the imul as the fourth instruction.)

>
> > .align 16
> > .globl rng_new
> > .type *rng_new,@function
> > rng_new:
> > * * * * movabs *$0x6595a395a1ec531b, %rcx

>
> > * * * * mov * *8(%rdi), %rax

>
> > * * * * # Or you could have xor %rsi, %rax if process number isthe second
> > parm
> > * * * * xor * *%rdi, %rax
> > * * * * imul * %rcx, %rax

>
> > * * * * mov * *(%rdi), %rdx * # The LCG update
> > * * * * add * *%rcx, (%rdi)
> > * * * * adc * *%rdx, 8(%rdi)

>
> > * * * * xor * *%rax, %rdx
> > * * * * ror * *$29, %rdx * *# Can use a rotate here dueto the extra entropy
> > from %rdx
> > * * * * xor * *%rdx, %rax
> > * * * * mov * *%rax, %rdx
> > * * * * shl * *$31, %rdx
> > * * * * xor * *%rdx, %rax
> > * * * * mov * *%rax, %rdx
> > * * * * shr * *$23, %rdx
> > * * * * xor * *%rdx, %rax
> > * * * * retq
> > .size * rng_new, .-rng_new

>
> > I've also looked at 192 and 256 bit variants, but getting a good fast
> > hash, or good fast LCG with that many bits is seemingly futile. *I'm
> > willing to be proved wrong though. :-)

>
> > Steven

>
> Okay... after quite a bit of effort, I've found something much, much
> better.
>
> This is three different rng's in one:
>
> .align 16
> .globl rng_hash_128
> .type *rng_hash_128,@function
> rng_hash_128:
> * * * * mov * *8(%rdi), %rax
>
> * * * * movabs $0x6595a395a1ec531b, %rcx
>
> # * Option 1)
> # * Non-threaded, fastest. *No xor instruction used.
>
> # * Option 2)
> # * Threaded, use the address of the seed as a nonce.
> # * * *xor * *%rdi, %rax
>
> # * Option 3)
> # * Threaded, pass a nonce as a second parameter.
> # * * * xor * *%rsi, %rax
> # *or have it stored in a larger seed:
> # * * *xor * *16(%rdi), %rax
>
> * * * * mov * *(%rdi), %rsi
> * * * * mul * *%rcx
>
> * * * * add * *%rcx, (%rdi)
> * * * * adc * *%rsi, 8(%rdi)
>
> * * * * xor * *%rsi, %rax
> * * * * xor * *%rdx, %rax
> * * * * mul * *%rcx
> * * * * add * *%rsi, %rax
> * * * * add * *%rdx, %rax
>
> * * * * retq
> .size * rng_hash_128, .-rng_hash_128
>
> For non-threaded use, pick the one without the extra xor instruction.
> This is fastest, taking 3.77s to output 10^9 64bit random numbers on
> this machine.


I get about 2.5 GB/s when I stick that in as inline asm on gcc. About
300 million calls per second, about 10 cycles per call. I used this
code:
Uint64 rv;
asm volatile (
"mov 8(%%rdi), %%rax \n"
"movabs $0x6595a395a1ec531b, %%rcx \n"
"mov (%%rdi), %%rsi \n"
"mul %%rcx \n"
"add %%rcx, (%%rdi) \n"
"adc %%rsi, 8(%%rdi) \n"
"xor %%rsi, %%rax \n"
"xor %%rdx, %%rax \n"
"mul %%rcx \n"
"add %%rsi, %%rax \n"
"add %%rdx, %%rax \n"
:"=a"(rv)//outputs
:"D"(this)//inputs
:"memory", "%rcx", "%rsi", "%rdx"//clobbers
);
return rv;
If I left off the "volatile" keyword then the optimizer seemed to go
berserk on non-vtable calls to it for some reason, though vtable calls
remained the same. Probably I screwed up the constraints somehow.

I also tried implementing it in C, but neither gcc nor MSVC will let
me use a 128 bit integer. I did implement it in C by scaling it down
to 32 bit instead of 64 bit and compiling on a 32 bit compiler. It
wasn't particularly fast that way relative to other 32 bit RNGs.

Count it as one more bit of speed-for-you != speed-for-me, not even
relative speed. Either that or my inline asm sucked, and my compiler
sucked too.

> If you have multiple threads, then pick option two. *The rng will use
> the different addresses of the seeds to make sure that each thread
> gets a different output sequence. *This takes 4.05s to output 10^9
> 64bit random numbers on this machine.
>
> If you are using multiple computers, or require multithreaded
> repeatability, then pick option three. *Pass a variable describing the
> thread/machine-number to the rng. *(Another option is to store this
> extra state into a slightly expanded seed, and use "xor * *16(%rdi),
> %rax" instead.)
>
> In effect, options two and three choose a 2^128 subsequence from a
> 2^192 element sequence.
>
> This rng passes TestU01 BigCrush with the top 32 bits, bottom 32 bits,
> middle 32 bits, and alternating bottom and top 32 bit bitstream, with
> a total of zero failures.


My testing is reporting good quality for it as well. I'm testing both
that algorithm in asm, plus that algorithm in C scaled down to operate
on 32 bit integers instead of 64.

> It works by hashing the output of a 128 bit LCG, with a simple
> multiplier of 2^64 + 1 and addend of a "random" 64 bit odd constant.
> Feel free to try some other constant. *As long as it has roughly half
> the bits set, and is odd, it should work well. * Since the base is a
> LCG, fast-forwarding or reversing the rng is trivial.
>
> The hash is based on merging the upper and lower 64 bit halves of two
> 64x64->128 bit multiplies. *This loses entropy... however since we
> also merge in another part of the seed, the extra entropy from that
> compensates. *The result is an irreversible hash that seems to have
> very good statistical properties.
>
> Why is this rng so fast?
>
> Most normal random number generators use a mapping x_new = f(x), where
> the function f defines how the seed state x is altered.
>
> This rng instead of outputting (part of) the seed, outputs a hash g(x)
> of it instead. *Thus f(x), and g(x) need to be calculated in every
> call. *This looks obviously slower than the other method. *However,
> two things save us. *The first is that modern processors are out-of-
> order, and can execute many instructions in parallel. *Thus the
> calculation of f() and g() can overlap. *The second is that the
> constraints of f() and g() are different.
>
> f() needs to map to a large cycle of states, and have good statistical
> randomness. *g() just needs to have good statistical randomness.
>
> So by picking a very fast f(), with poor randomness, and a good-enough
> g() we can obtain a very fast rng that still passes stringent
> statistical tests. *The problem is finding a good fast hash.
> Fortunately, the extended-precision multiply used above works for g().
> (A fast yet poor quality LCG easily works for f() )
>
> Note: when re-implementing the above, be careful with the instruction
> ordering. *Even small rearrangements can alter the speed by tens of
> percent.
>
> Steven


On the complex-output-functions = better-parallelism argument I'm
pretty meh. Yeah, the output function can be done in parallel with
the state transition function. Yeah, the output function can be done
in parallel with the next output function if the state transition is
fast enough and the calling code is nice enough and the scheduler is
psychic enough. But there are severe limits to how much can be done
at once, both due to limits of execution resources (large integer
multiplication in particular takes a huge transistor & power budget)
and due to scheduling difficulty. My understanding is that most CPUs
can't maintain a IPC more than 4 on any remotely real world code no
matter how parallel it is, far lower if lots of complex operations
like large integer multiplication are involved. Notice how recent CPU
generations haven't really been getting significantly more
superscalar? These days heat is the fundamental limiting factor, and
executing all those operations will produce the same amount of heat
regardless of how how smart the scheduler is... in fact a smarter
scheduler will just add overhead to the thermal budget, so the next
few CPU generations are not likely to emphasize OOO any more than the
last few have (in the desktop world anyway...).

Also, most fast RNGs with simple output functions have highly parallel
state transition functions, typically 2 to 4 simple operations. Your
code there is a critical path of 2 simple operations on the state
transition function, and 6 operations on the output function, two of
which are high latency instructions. mul 64x64->128 is reported to be
5 cycle latency on recent AMD, Intel's most recent optimization guide
seems to claim a latency of 14-18 cycles on that but probably I'm
misreading that somehow as it seems improbably high... a quick
googling suggests it might be closer to 3 cycles. Since you have two
of them in your critical path I end up with a best-case performance on
the order of 6-10 cycles). If the output never actually gets used,
that's a different story... the simple transition function might
dominate then.
Reply With Quote