Go Back   Rhinocerus > Newsgroup > Newsgroup comp.lang.* 2 > Newsgroup comp.lang.asm.x86

Reply
 
Thread Tools Display Modes
  #61 (permalink)  
Old 01-11-2012, 12:19 AM
sfuerst
Guest
 
Posts: n/a
Default Re: Fast and small Multiplicative Random number generator

On Jan 5, 7:42*pm, sfuerst <svfue...@nospicedham.gmail.com> wrote:

> This is the rng I'm investigating now. *(Still running BigCrush tests
> on various 32bit functions of the 64 bit output.) * It doesn't fit all
> of your criteria, but it fits most, and is very very fast.
>
> It is a hash of the output of a 128 bit LCG. *The LCG uses a
> multiplier of 2^64 + 1, so is easy to step. *Fast forwarding, and
> reversing are trivial. *The hash is the most complex thing I've
> managed to stuff into the cycles spent accessing L1 in the LCG update.
>
> My constraints are a little different than yours, so more is effort
> has been "spent" making sure parallel users can be assured of
> different sequences. *(Hence the imul as the fourth instruction.)
>
> .align 16
> .globl rng_new
> .type *rng_new,@function
> rng_new:
> * * * * movabs *$0x6595a395a1ec531b, %rcx
>
> * * * * mov * *8(%rdi), %rax
>
> * * * * # Or you could have xor %rsi, %rax if process number is the second
> parm
> * * * * xor * *%rdi, %rax
> * * * * imul * %rcx, %rax
>
> * * * * mov * *(%rdi), %rdx * # The LCG update
> * * * * add * *%rcx, (%rdi)
> * * * * adc * *%rdx, 8(%rdi)
>
> * * * * xor * *%rax, %rdx
> * * * * ror * *$29, %rdx * *# Can use a rotate here due to the extra entropy
> from %rdx
> * * * * xor * *%rdx, %rax
> * * * * mov * *%rax, %rdx
> * * * * shl * *$31, %rdx
> * * * * xor * *%rdx, %rax
> * * * * mov * *%rax, %rdx
> * * * * shr * *$23, %rdx
> * * * * xor * *%rdx, %rax
> * * * * retq
> .size * rng_new, .-rng_new
>
> I've also looked at 192 and 256 bit variants, but getting a good fast
> hash, or good fast LCG with that many bits is seemingly futile. *I'm
> willing to be proved wrong though. :-)
>
> Steven


Okay... after quite a bit of effort, I've found something much, much
better.

This is three different rng's in one:

..align 16
..globl rng_hash_128
..type rng_hash_128,@function
rng_hash_128:
mov 8(%rdi), %rax

movabs $0x6595a395a1ec531b, %rcx

# Option 1)
# Non-threaded, fastest. No xor instruction used.

# Option 2)
# Threaded, use the address of the seed as a nonce.
# xor %rdi, %rax

# Option 3)
# Threaded, pass a nonce as a second parameter.
# xor %rsi, %rax
# or have it stored in a larger seed:
# xor 16(%rdi), %rax

mov (%rdi), %rsi
mul %rcx

add %rcx, (%rdi)
adc %rsi, 8(%rdi)

xor %rsi, %rax
xor %rdx, %rax
mul %rcx
add %rsi, %rax
add %rdx, %rax

retq
..size rng_hash_128, .-rng_hash_128

For non-threaded use, pick the one without the extra xor instruction.
This is fastest, taking 3.77s to output 10^9 64bit random numbers on
this machine.

If you have multiple threads, then pick option two. The rng will use
the different addresses of the seeds to make sure that each thread
gets a different output sequence. This takes 4.05s to output 10^9
64bit random numbers on this machine.

If you are using multiple computers, or require multithreaded
repeatability, then pick option three. Pass a variable describing the
thread/machine-number to the rng. (Another option is to store this
extra state into a slightly expanded seed, and use "xor 16(%rdi),
%rax" instead.)

In effect, options two and three choose a 2^128 subsequence from a
2^192 element sequence.

This rng passes TestU01 BigCrush with the top 32 bits, bottom 32 bits,
middle 32 bits, and alternating bottom and top 32 bit bitstream, with
a total of zero failures.

It works by hashing the output of a 128 bit LCG, with a simple
multiplier of 2^64 + 1 and addend of a "random" 64 bit odd constant.
Feel free to try some other constant. As long as it has roughly half
the bits set, and is odd, it should work well. Since the base is a
LCG, fast-forwarding or reversing the rng is trivial.

The hash is based on merging the upper and lower 64 bit halves of two
64x64->128 bit multiplies. This loses entropy... however since we
also merge in another part of the seed, the extra entropy from that
compensates. The result is an irreversible hash that seems to have
very good statistical properties.


Why is this rng so fast?

Most normal random number generators use a mapping x_new = f(x), where
the function f defines how the seed state x is altered.

This rng instead of outputting (part of) the seed, outputs a hash g(x)
of it instead. Thus f(x), and g(x) need to be calculated in every
call. This looks obviously slower than the other method. However,
two things save us. The first is that modern processors are out-of-
order, and can execute many instructions in parallel. Thus the
calculation of f() and g() can overlap. The second is that the
constraints of f() and g() are different.

f() needs to map to a large cycle of states, and have good statistical
randomness. g() just needs to have good statistical randomness.

So by picking a very fast f(), with poor randomness, and a good-enough
g() we can obtain a very fast rng that still passes stringent
statistical tests. The problem is finding a good fast hash.
Fortunately, the extended-precision multiply used above works for g().
(A fast yet poor quality LCG easily works for f() )

Note: when re-implementing the above, be careful with the instruction
ordering. Even small rearrangements can alter the speed by tens of
percent.

Steven
Reply With Quote
Alt Today
Advertising
 
and become member of Rhinocerus
Standard Sponsored Links

  #62 (permalink)  
Old 01-11-2012, 12:21 AM
sfuerst
Guest
 
Posts: n/a
Default Re: Fast and small Multiplicative Random number generator

On Jan 8, 2:51*pm, Frank Kotler <fbkot...@myfairpoint.net> wrote:
> orz wrote:
> > Hey, do you have any problems with people using any of this stuff,
> > modifying it, etc, potentially in closed source or commercial apps? *I
> > try to keep all of my recommended RNGs public domain or very close to
> > public domain.

>
> Okay...
>
> I guess you guys have quit using rdtsc...


rdtsc doesn't have enough entropy to be usable in a rapidly called
rng. Instead, what you should do is gather the few bits of entropy
that rdtsc has, and reseed the entire state of the rng once you have
enough.

Unfortunately, rdtsc is a rather slow instruction, so calling it every
rng invocation is painful.

> something's been nagging at my
> mind... on my machine (P4), rdtsc always returns an even number...


That seems really annoying. The most obvious assumption you could
make is that the lowest bit of the cycle count is the most random.

Steven
Reply With Quote
  #63 (permalink)  
Old 01-11-2012, 10:37 AM
orz
Guest
 
Posts: n/a
Default Re: Fast and small Multiplicative Random number generator

On Jan 10, 4:38 pm, sfuerst <svfue...@nospicedham.gmail.com> wrote:
> On Jan 6, 3:31 pm, orz <cdh...@nospicedham.gmail.com> wrote:
>
>
>
>
>
>
>
>
>
> > On Jan 6, 11:31 am, sfuerst <svfue...@nospicedham.gmail.com> wrote:

>
> > > > The interstate correlation seems bad at all addresses, including
> > > > sustained interstate correlation. With a statesize of only 2**128
> > > > that will likely cause problems in a number of parallel usage
> > > > scenarios. Though in some scenarios that can be made up for by the
> > > > use of the address. Which is a practice that I disapprove of - it
> > > > makes results non-reproducible, breaks some uses (including some ofmy
> > > > RNG tests...), and makes testing harder (I could, for instance, have
> > > > missed the problems that occur at some addresses). In short, it makes
> > > > everything messier by violating the abstraction layer I expect RNGsto
> > > > obey. The benefits it provides, while potentially comforting for some
> > > > multithreaded users, are of very little utility when paired with high
> > > > quality RNGs that are seeded correctly - I attempt to provide similar
> > > > but more flexible benefits to my (non-existant) users through a
> > > > combination of good RNG general properties (for that kind of purpose
> > > > statespace size and interseed correlations in particular) and
> > > > intelligent seeding interfaces.

>
> > > The advantage of using the address is that no communication is
> > > required at all between threads to guarantee that they get different
> > > sequences. Sequence-dividing schemes run into problems when you have
> > > threads that fork off other threads, and the resulting number of users
> > > is initially unknown. (Partitioning the seed-space becomes
> > > impossible.) The other issue is multi-machine codes using something
> > > like MPI. There, you have a number for each "rank", and need it to
> > > make the rng give a different sequence. (That is the commented out
> > > rsi as second parameter option.)

>
> > > Of course, since it seems that the quality of the resulting stream is
> > > strongly affected by the initial xor value, something needs to be
> > > fixed first.

>
> > I'm aware of the benefits. Fundamentally, the issue of trying to get
> > each thread an uncorrelated sequence is identical to the issue of
> > trying to get each run an uncorrelated sequence. You can solve it in
> > an automated fashion for the specific case of per-thread RNGs in
> > shared address space TLS by violating the abstraction layer, but that
> > is no help at all on sequential runs, distributed runs, parallel runs,
> > sequential threads, etc. Users who need decent random number
> > generators may be doing monte carlo simulations or whatever that want
> > a lack of correlation over any of those domains. Users who really
> > know what they're doing will be hurt more than helped by your method
> > (they might appreciate simplified thread seeding, but they don't want
> > to have to remember the details of how the abstraction layer is
> > violated), and those who don't will only be helped in a limited set of
> > scenarios.

>
> I don't think things are that bad. Often the simple case will "just
> work". Leave the complex interface for those who really need it, and
> have sensible defaults that work for most users.
>
> > You effectively get an extra 4-8 bytes of static state, without having
> > to store it in memory or even read it from memory, and those extra
> > bits of state are guaranteed to never share the same value as any
> > other RNG that is instantiated at the same time in the same address
> > space.

>
> I know. It feels like cheating... :-) It is the only way I know to
> get a random number generator to have a larger state-space than that
> bounded by the size of its seed. Breaking fundamental limits is
> always entertaining.
>

Actually, on second thought I want to qualify that slightly - you have
no control over the value of that state, so you have to
pessimistically assume it will be near-zero entropy, effectively far
less than than a pointer size worth of entropy. Really, the only
thing you can count on is uniqueness (for the life the RNG instance).
Which is, admittedly, pretty sweet. But I shouldn't have described it
in terms of the pointer size, size the pointer size is only a measure
of the maximum entropy involved, and has very little to do, in
practice, with the actual entropy involved.
>
>
>
>
>
>
>
> > My approach shares some similarities but is pretty different:
> > 1. I have a large enough statespace so that initializing them to a
> > random state is probabilistically good enough to eliminate
> > correlations over a reasonable domain, regardless of the nature of the
> > domain. That generally means at least 3 variables worth of state for
> > 16 to 64 bit RNGs, preferably 4 or 5. And definitely more for 8 bit
> > RNGs.
> > 2a. I supply an automatic seeding mechanism to users. They don't have
> > to use it, but I encourage its use for cases where neither crypto
> > security nor reproducible behavior is required. They can still
> > reproduce the RNG output by copying or serializing the RNG state, just
> > not by saving the seed since there is no visible seed in the
> > autoseeding case. The automatic seeding mechanism initializes an
> > internal RNG with a function of the program start time, the current
> > time, the current value of the stack pointer (which is unique per-
> > thread for the life of the thread), the address of the RNG it was
> > called on (the same basic thing you were using!, but without the
> > problems), the number of RNGs initialized within the current thread
> > (if TLS variables are supported on the platform), a few other things,
> > and on windows some platform specific stuff (I ought to add platform
> > specific stuff for unix sometime...). The internal RNG then
> > initializes the users RNG to a random state, in a fashion that the
> > users RNG algorithm can customize if it needs to (to preclude some
> > states, for instance). Some of those are precomputed at application
> > launch time, others are done at autoseeding time. The internal RNG
> > algorithm is chosen to be semi-fast (on 64 bit anyway... slower on
> > other CPUs), very high quality, guaranteed to have no short cycles,
> > have high quality seeding, be quick to seed, and be capable of seeding
> > from oddly structured data.
> > 2b. I also supply some more complex tools for high quality seeding for
> > cryptographic users and power users, who hopefully know what they're
> > doing. And I also supply a very simple seed-from-a-64-bit-integer
> > interface, which sometimes maps to an RNGs reference seeding algorithm
> > (if it has one that uses integers in the range supplied) and other
> > times maps to something similar to the autoseeder except
> > reproducible. All RNGs get all of those seeding mechanisms on an
> > identical interface. Additionally many RNGs also offer other custom
> > seeding mechanisms, and a few have default states that they go to
> > deterministically if not seeded. Those without default states require
> > a seed in their constructor to prevent accidents unless the user
> > explicitly requests otherwise (by passing a magic token to the
> > constructor) or is using the "raw" version of the RNG without the
> > normal interface. If an RNG that does not have a default state is
> > used without being seeded then its behavior is not defined.

>
> This is the "standard" technique for parallel generators... so there
> is nothing much wrong with it. That doesn't mean that other (perhaps
> simpler) methods don't exist. Sometimes the new methods do end up
> being better... That's how progress happens. :-)
>

Is it? I've seen several of the individual elements of my scheme in
other packages (TestU01, GSL, Boost, etc), but never more than one or
two of the elements in a single package, and a few elements I've never
seen elsewhere. I've seen your xor-with-address scheme several times
on the web, always combined with xorshift+multiply pairs, but never in
a library that I can recall.

I think that interface-wise, for the simple case, my scheme is as
simple as anything that allows multiple RNG states (ie simpler than
the libraries aimed at research, but not as simple as libc rand/
srand). Implementation-wise, my scheme is not simple... there ends up
being a call to the system RNG at program start, and some additional
hashing on a per-autoseeding basis and a tiny bit on a per-word-of-RNG-
state-autoseeded basis. For most uses it's overkill, but since many
users have a hard time figuring out just how broad a scope they need
lack of correlation over, and since the resources used are cheap, I
shoot for a little overkill.
>
>
>
>
>
>
>
> > No interthread or interprocess or intermachine communication is
> > required. It mostly works equally well for sequential RNG
> > instantiations in the same thread, in different threads, on sequential
> > runs, on simultaneous runs, on distributed machines, etc. The weakest
> > case at the moment is if a thread is created, then destroyed, then
> > very quickly recreated with the same stack address and TLS address.
> > Simultaneous launches on non-windows platforms are also weak. Every
> > other case that I can think of is strong, though not up to
> > cryptographic standards. The platform-specific stuff is done just
> > once per application launch. The requirements are:
> > 1. Either the library initialization function or the first automatic
> > seeding is performed before additional threads are launched. That
> > will tend to naturally be true for programs that don't go around doing
> > crazy things with the launch and linking processes.
> > 2. the degree of lack of correlation desired across the domain not
> > exceed the square root of the statespace size. That is almost
> > identical to a fundamental requirement for high quality RNGs that
> > would apply regardless, so it's no big deal.

>
> You can fix the rapid recreate problem by using a guid. Provided that
> the guid updates rapidly enough, there is no way other users can have
> the same initial seed state. The trick is to make the creation stall
> long enough to fulfill this speed limitation. Since thread creation
> is (meant to be) slow and rare, this isn't much of a constraint in
> practice.


If fixed the non-windows issues since my last post. I'm still not
sure what to do about the thread recreation issue. One solution would
be to use some sort of locking mechanism or atomic math, maybe
EnterCriticalSection or InterlockedIncrement on Win32/MSVC. But I'd
like to stay very portable... I currently use no libraries aside from
the C++ standard library (and, on windows, kernel32.lib), and nothing
compiler or OS or ISA specific without ifdefs and fallback paths on
unrecognized systems (admittedly the fallbacks are not very good for
OSes that are neither windows nor *nix...). Another solution would
be a malloc(1) on a per-thread basis, with no corresponding free. I'm
not familiar with "guid", but it appears to refer to a variety of
different algorithms for generating 128 bit numbers intended to be
unique across process boundaries and often network boundaries.
Apparently I would call UuidCreateSequential on win32, I'm not sure
what I would do on *nix? But that isn't really looking any better to
me in terms of portability or speed than the other options. Though it
could replace the RtlGenRandom and/or /dev/random usage on program
start.

> > The typical multithreaded user writes something like:

>
> > typedef PractRand::RNGs::Polymorphic::jsf64 RNG_TYPE;
> > __thread RNG_TYPE rng(PractRand::SEED_AUTO);

>
> > ...and is done, with the same basic benefits as your approach. The
> > more sophisticated user can do more sophisticated things though, no
> > one will be fooled in to thinking that a nondeterministic case should
> > behave deterministically, and serializing the RNG state for later
> > reproduction will work properly (which will let some of my more exotic
> > RNG tests behave correctly on it).

>
> I'll admit that the interface (if you can call it that) for the form
> of rng I've proposed isn't well-suited for C++. However, since I
> don't like C++ very much, I consider that a feature rather than a
> bug. :-P
>

I don't think it has issues with C++ really. Well, I suppose arguably
with object orientation in general since its state can't (or
shouldn't) really be treated like a normal object. But C++ would
handle it fine a simple function instead of a class. What I would say
is that it lacks reproducibility (via either manual seeding or state
serialization). RNG users sometimes want reproducibility for
debugging purposes, network games and apps, hashing purposes, probably
a few other uses I can't think of atm.
>
>
>
>
>
> > > Notice how it isn't using the adc instruction at all. It also seems
> > > to be rather poor code. :-( The above benchmarks as taking 5.8s to do
> > > a billion random numbers. The asm version I gave takes 4.2s Note
> > > that the fastest of the other rng's you gave takes 4.6s on this
> > > machine. I don't begrudge the compile much though... even simple
> > > rearrangements of the asm instructions cause the time taken to vary by
> > > up to 30%. It is very hard to come up with the right ordering to
> > > overlap memory accesses with computation.

>
> > 4.6 seconds, ow! I know performance at the high end gets weird, but
> > still I had hoped that at least one of the RNGs I posted would do
> > better. Several of them reliably take less then 2 seconds for that on
> > both my Core i5 2500 and my old Core2 3 GHrz using both 64 bit gcc and
> > less than twice that using 32 bit MSVC. All w/o asm.

>
> My computer is old and slow. Performance is relative, remember.


Yeah. Sigh. Even just looking at it as relative performance... Is
your optimizer allowed to inline your stuff and/or remap register
names, or does it have to stick to calling conventions and/or the code
as you wrote it? On my benchmarks, if I force it to not inline the
calls then they hit a performance maximum around 300 million calls per
second (though it varies quite a bit, from a low around 100 to a high
around 500, it's usually between 250 and 325), regardless of how fast
the RNG itself is. Even return 0 will get capped at that rate. If I
let it inline, even then it will still hit a cap around 1.1 billion
calls per second even for very simple RNGs like "int random() {int tmp
= a; a += b; return tmp;}". A simple "return 0;" is faster still, but
by so much that I suspect it optimizes away almost the entire calling
loop in that instance. And a few of the RNGs I've posted in this
thread get 0.7 to 1.0 billion calls per second here - just under what
looks like the maximum possible performance on this CPU / compiler
pair. MSVC results are very similar, though it will inline even
across compilation unit boundaries, unlike gcc (4.5.4) - I only see
the lower cap on MSVC if the call goes through a vtable.
In my attempts to achieve more meaningful performance measures I've
tried to look for aggregate test results across multiple compilers,
multiple CPU generations, multiple structures of calling loops, etc.
I had thought it helped a little, but it's hard to say really, and
often that gets tiresome and I'll skip most of it for extended periods
of time.

> > > Once I find a version with slightly better properties, I'll think
> > > about finding something the C compiler can convert into good asm.
> > > Previously, the trick was to use a union with an intrinsic 128 bit
> > > unsigned integer type.

>
> > > > ---
> > > > BTW, side note, is that the correct usage of movabs? In your code it
> > > > is used to move a constant in to a register, but my quick googling of
> > > > it suggested that it was only for reading from memory at absolute
> > > > addresses? Really I don't know much about x86-64 asm so I could
> > > > easily have just misunderstood what I read.

>
> > > "movabs "is the instruction gcc uses to move a 64 bit immediate into a
> > > 64 bit register, as you can see from the above. Don't ask me why they
> > > call it that rather than just plain old "mov". :-/

>
> > > Steven

>
> > The following variant of that:
> > Word tmp = high;
> > tmp ^= tmp >> (OUTPUT_BITS / 2);
> > tmp *= K;
> > tmp ^= rotate(tmp ^ low, OUTPUT_BITS/3);
> > Word old = low;
> > low += K;
> > high += old + ((low < K) ? 1 : 0);
> > return tmp;

>
> > ...does much better on statistical tests when used on 32 bit integers
> > instead of 64 bit integers. For the full 64 bit versions there is no
> > way to distinguish quality as both that and the original pass all
> > tests I've tried (when the xored value wasn't low-entropy), but
> > probably the 32 bit version is representative. It still has the
> > interstate correlation issues, but those are very hard to get rid of
> > on fast random access RNGs, and no worse than the ones in the CLCG I'm
> > looking to replace. Adding a second multiply and another xorshift
> > gets the interstate correlation issue under control without hurting
> > performance intolerably - users who want performance faster than that
> > can damn well use a non-random-access RNG. I might also try combining
> > successive elements like my hashed LFSR did to see if I can eliminate
> > the need for a 2nd multiply that way. I'm thinking of adding a third
> > word of state doing something dumb to get the statespace a little
> > bigger, but at this point really it's looking good enough to be worth
> > replacing the CLCG.

>
> I tried something much like the above, and it wasn't very good. It
> either had poor randomness, or failed to be fast enough to bother
> with.
>

That change had no performance impact on my computer (it's becoming
increasing apparent that we see very different performance impact from
changes), and did well on statistical tests when used on 32 bit
integers where the original (when changed to use 32 bit integers) did
poorly (IIRC failing after 5 minutes or so). A few other tweaks got
it to the point of passing an 8 hour test on 32 bit, I didn't try any
longer tests.
>
> > Hey, do you have any problems with people using any of this stuff,
> > modifying it, etc, potentially in closed source or commercial apps? I
> > try to keep all of my recommended RNGs public domain or very close to
> > public domain.

>
> No problems. It is just math, after all.
>
> Steven


Excellent, I'll probably be using something in the general ballpark of
this stuff for random access RNGs for the time being, until I can
figure out a way to get good random access without multiplication.



On Jan 10, 5:19*pm, sfuerst <svfue...@nospicedham.gmail.com> wrote:
> On Jan 5, 7:42*pm, sfuerst <svfue...@nospicedham.gmail.com> wrote:
>
>
>
>
>
>
>
>
>
> > This is the rng I'm investigating now. *(Still running BigCrush tests
> > on various 32bit functions of the 64 bit output.) * It doesn't fit all
> > of your criteria, but it fits most, and is very very fast.

>
> > It is a hash of the output of a 128 bit LCG. *The LCG uses a
> > multiplier of 2^64 + 1, so is easy to step. *Fast forwarding, and
> > reversing are trivial. *The hash is the most complex thing I've
> > managed to stuff into the cycles spent accessing L1 in the LCG update.

>
> > My constraints are a little different than yours, so more is effort
> > has been "spent" making sure parallel users can be assured of
> > different sequences. *(Hence the imul as the fourth instruction.)

>
> > .align 16
> > .globl rng_new
> > .type *rng_new,@function
> > rng_new:
> > * * * * movabs *$0x6595a395a1ec531b, %rcx

>
> > * * * * mov * *8(%rdi), %rax

>
> > * * * * # Or you could have xor %rsi, %rax if process number isthe second
> > parm
> > * * * * xor * *%rdi, %rax
> > * * * * imul * %rcx, %rax

>
> > * * * * mov * *(%rdi), %rdx * # The LCG update
> > * * * * add * *%rcx, (%rdi)
> > * * * * adc * *%rdx, 8(%rdi)

>
> > * * * * xor * *%rax, %rdx
> > * * * * ror * *$29, %rdx * *# Can use a rotate here dueto the extra entropy
> > from %rdx
> > * * * * xor * *%rdx, %rax
> > * * * * mov * *%rax, %rdx
> > * * * * shl * *$31, %rdx
> > * * * * xor * *%rdx, %rax
> > * * * * mov * *%rax, %rdx
> > * * * * shr * *$23, %rdx
> > * * * * xor * *%rdx, %rax
> > * * * * retq
> > .size * rng_new, .-rng_new

>
> > I've also looked at 192 and 256 bit variants, but getting a good fast
> > hash, or good fast LCG with that many bits is seemingly futile. *I'm
> > willing to be proved wrong though. :-)

>
> > Steven

>
> Okay... after quite a bit of effort, I've found something much, much
> better.
>
> This is three different rng's in one:
>
> .align 16
> .globl rng_hash_128
> .type *rng_hash_128,@function
> rng_hash_128:
> * * * * mov * *8(%rdi), %rax
>
> * * * * movabs $0x6595a395a1ec531b, %rcx
>
> # * Option 1)
> # * Non-threaded, fastest. *No xor instruction used.
>
> # * Option 2)
> # * Threaded, use the address of the seed as a nonce.
> # * * *xor * *%rdi, %rax
>
> # * Option 3)
> # * Threaded, pass a nonce as a second parameter.
> # * * * xor * *%rsi, %rax
> # *or have it stored in a larger seed:
> # * * *xor * *16(%rdi), %rax
>
> * * * * mov * *(%rdi), %rsi
> * * * * mul * *%rcx
>
> * * * * add * *%rcx, (%rdi)
> * * * * adc * *%rsi, 8(%rdi)
>
> * * * * xor * *%rsi, %rax
> * * * * xor * *%rdx, %rax
> * * * * mul * *%rcx
> * * * * add * *%rsi, %rax
> * * * * add * *%rdx, %rax
>
> * * * * retq
> .size * rng_hash_128, .-rng_hash_128
>
> For non-threaded use, pick the one without the extra xor instruction.
> This is fastest, taking 3.77s to output 10^9 64bit random numbers on
> this machine.


I get about 2.5 GB/s when I stick that in as inline asm on gcc. About
300 million calls per second, about 10 cycles per call. I used this
code:
Uint64 rv;
asm volatile (
"mov 8(%%rdi), %%rax \n"
"movabs $0x6595a395a1ec531b, %%rcx \n"
"mov (%%rdi), %%rsi \n"
"mul %%rcx \n"
"add %%rcx, (%%rdi) \n"
"adc %%rsi, 8(%%rdi) \n"
"xor %%rsi, %%rax \n"
"xor %%rdx, %%rax \n"
"mul %%rcx \n"
"add %%rsi, %%rax \n"
"add %%rdx, %%rax \n"
:"=a"(rv)//outputs
:"D"(this)//inputs
:"memory", "%rcx", "%rsi", "%rdx"//clobbers
);
return rv;
If I left off the "volatile" keyword then the optimizer seemed to go
berserk on non-vtable calls to it for some reason, though vtable calls
remained the same. Probably I screwed up the constraints somehow.

I also tried implementing it in C, but neither gcc nor MSVC will let
me use a 128 bit integer. I did implement it in C by scaling it down
to 32 bit instead of 64 bit and compiling on a 32 bit compiler. It
wasn't particularly fast that way relative to other 32 bit RNGs.

Count it as one more bit of speed-for-you != speed-for-me, not even
relative speed. Either that or my inline asm sucked, and my compiler
sucked too.

> If you have multiple threads, then pick option two. *The rng will use
> the different addresses of the seeds to make sure that each thread
> gets a different output sequence. *This takes 4.05s to output 10^9
> 64bit random numbers on this machine.
>
> If you are using multiple computers, or require multithreaded
> repeatability, then pick option three. *Pass a variable describing the
> thread/machine-number to the rng. *(Another option is to store this
> extra state into a slightly expanded seed, and use "xor * *16(%rdi),
> %rax" instead.)
>
> In effect, options two and three choose a 2^128 subsequence from a
> 2^192 element sequence.
>
> This rng passes TestU01 BigCrush with the top 32 bits, bottom 32 bits,
> middle 32 bits, and alternating bottom and top 32 bit bitstream, with
> a total of zero failures.


My testing is reporting good quality for it as well. I'm testing both
that algorithm in asm, plus that algorithm in C scaled down to operate
on 32 bit integers instead of 64.

> It works by hashing the output of a 128 bit LCG, with a simple
> multiplier of 2^64 + 1 and addend of a "random" 64 bit odd constant.
> Feel free to try some other constant. *As long as it has roughly half
> the bits set, and is odd, it should work well. * Since the base is a
> LCG, fast-forwarding or reversing the rng is trivial.
>
> The hash is based on merging the upper and lower 64 bit halves of two
> 64x64->128 bit multiplies. *This loses entropy... however since we
> also merge in another part of the seed, the extra entropy from that
> compensates. *The result is an irreversible hash that seems to have
> very good statistical properties.
>
> Why is this rng so fast?
>
> Most normal random number generators use a mapping x_new = f(x), where
> the function f defines how the seed state x is altered.
>
> This rng instead of outputting (part of) the seed, outputs a hash g(x)
> of it instead. *Thus f(x), and g(x) need to be calculated in every
> call. *This looks obviously slower than the other method. *However,
> two things save us. *The first is that modern processors are out-of-
> order, and can execute many instructions in parallel. *Thus the
> calculation of f() and g() can overlap. *The second is that the
> constraints of f() and g() are different.
>
> f() needs to map to a large cycle of states, and have good statistical
> randomness. *g() just needs to have good statistical randomness.
>
> So by picking a very fast f(), with poor randomness, and a good-enough
> g() we can obtain a very fast rng that still passes stringent
> statistical tests. *The problem is finding a good fast hash.
> Fortunately, the extended-precision multiply used above works for g().
> (A fast yet poor quality LCG easily works for f() )
>
> Note: when re-implementing the above, be careful with the instruction
> ordering. *Even small rearrangements can alter the speed by tens of
> percent.
>
> Steven


On the complex-output-functions = better-parallelism argument I'm
pretty meh. Yeah, the output function can be done in parallel with
the state transition function. Yeah, the output function can be done
in parallel with the next output function if the state transition is
fast enough and the calling code is nice enough and the scheduler is
psychic enough. But there are severe limits to how much can be done
at once, both due to limits of execution resources (large integer
multiplication in particular takes a huge transistor & power budget)
and due to scheduling difficulty. My understanding is that most CPUs
can't maintain a IPC more than 4 on any remotely real world code no
matter how parallel it is, far lower if lots of complex operations
like large integer multiplication are involved. Notice how recent CPU
generations haven't really been getting significantly more
superscalar? These days heat is the fundamental limiting factor, and
executing all those operations will produce the same amount of heat
regardless of how how smart the scheduler is... in fact a smarter
scheduler will just add overhead to the thermal budget, so the next
few CPU generations are not likely to emphasize OOO any more than the
last few have (in the desktop world anyway...).

Also, most fast RNGs with simple output functions have highly parallel
state transition functions, typically 2 to 4 simple operations. Your
code there is a critical path of 2 simple operations on the state
transition function, and 6 operations on the output function, two of
which are high latency instructions. mul 64x64->128 is reported to be
5 cycle latency on recent AMD, Intel's most recent optimization guide
seems to claim a latency of 14-18 cycles on that but probably I'm
misreading that somehow as it seems improbably high... a quick
googling suggests it might be closer to 3 cycles. Since you have two
of them in your critical path I end up with a best-case performance on
the order of 6-10 cycles). If the output never actually gets used,
that's a different story... the simple transition function might
dominate then.
Reply With Quote
  #64 (permalink)  
Old 01-11-2012, 10:39 AM
orz
Guest
 
Posts: n/a
Default Re: Fast and small Multiplicative Random number generator

On Jan 10, 5:19*pm, sfuerst <svfue...@nospicedham.gmail.com> wrote:
> Note: when re-implementing the above, be careful with the instruction
> ordering. *Even small rearrangements can alter the speed by tens of
> percent.


forgot to ask: what CPU are you using?
Reply With Quote
  #65 (permalink)  
Old 01-11-2012, 11:59 PM
orz
Guest
 
Posts: n/a
Default Re: Fast and small Multiplicative Random number generator

On Jan 11, 3:37*am, orz <cdh...@nospicedham.gmail.com> wrote:
> Also, most fast RNGs with simple output functions have highly parallel
> state transition functions, typically 2 to 4 simple operations. *Your
> code there is a critical path of 2 simple operations on the state
> transition function, and 6 operations on the output function, two of
> which are high latency instructions. *mul 64x64->128 is reported to be
> 5 cycle latency on recent AMD, Intel's most recent optimization guide
> seems to claim a latency of 14-18 cycles on that but probably I'm
> misreading that somehow as it seems improbably high... a quick
> googling suggests it might be closer to 3 cycles. *Since you have two
> of them in your critical path I end up with a best-case performance on
> the order of 6-10 cycles). *If the output never actually gets used,
> that's a different story... the simple transition function might
> dominate then.


I was a bit tired when I wrote that, lots of typos and mistakes though
I stand by the general concept. The talk of latencies should have
mentioned throughputs and ratios of throughputs to latencies. Intel
claims that mul 64x64->128 should have a latency of 4 to 10 (usually
10) cycles on core 7 or 7 cycles on core 2, and a throughput of 1
cycle on core 7 or 3 cycles on core 2. The ratio of those latencies
to throughputs ranges from 2.3 to 10.0. Which implies that executing
sequential output functions in parallel with each other is a
significant possibility for that RNG. BTW, add-with-carry is listed
as having significantly worse latency and throughput than regular
add... add at 1 latency and 0.33 throughput on all core7s and core2s,
adc at 2 latency and 1 to 1.5 throughput, cmp+cmov+add at 4 latency,
1.16 to 1.66 throughput (adding the throughputs of the individual
instructions, though technically that's not correct).

If we assume that the scheduling is perfect and there's no extra
complications from the calling code or anything, then what matters is
the larger of the overall functions throughput and the state
transition functions critical path latency. For that RNG, that looks
like 3 cycles latency for the state transition function on most of
Intel's desktop CPUs, and an overall throughput in the neighborhood of
~5 cycles on core7 or ~8 cycles on core2. Compare that to, say, the
RNG on Robert Jenkins small fast RNG design page, with a critical path
on the state transition function of 1 rotate and 2 additions, and a
grand total of 4 additions, 1 xor, and 2-3 rotates - a critical path
latency of 3 cycles and an overall throughput of a little less than 3
cycles.

Of course, that's all based upon massively simplified models of the
CPUs performance - reality rarely conforms to it. But since our main
alternative seems to be benchmarks that yield very different results
on different computers...
Reply With Quote
  #66 (permalink)  
Old 01-12-2012, 01:28 AM
sfuerst
Guest
 
Posts: n/a
Default Re: Fast and small Multiplicative Random number generator

On Jan 11, 3:39*am, orz <cdh...@nospicedham.gmail.com> wrote:
> On Jan 10, 5:19*pm, sfuerst <svfue...@nospicedham.gmail.com> wrote:
>
> > Note: when re-implementing the above, be careful with the instruction
> > ordering. *Even small rearrangements can alter the speed by tens of
> > percent.

>
> forgot to ask: what CPU are you using?


An Opteron 280. It obviously has quite a different behaviour than the
Intel machine you are testing on.

Steven
Reply With Quote
  #67 (permalink)  
Old 01-12-2012, 01:46 AM
hopcode
Guest
 
Posts: n/a
Default Re: Fast and small Multiplicative Random number generator

something gone wrong by posting with google...
ok, i post it again

On 12.01.2012 01:59, orz wrote:
> Of course, that's all based upon massively simplified models of the
> CPUs performance - reality rarely conforms to it. But since our main
> alternative seems to be benchmarks that yield very different results
> on different computers...


i think struggling on single CPU instructions is not
so worth the while nowadays. SIMD SSE may give
some advantages, when using only the CPU.
btw, an exaustive paper benchmarking the Park-Miller using CUDA.

http://www.cs.ucl.ac.uk/staff/W.Lang...2009_CIGPU.pdf

and B.Langdon's publications, interesting imho.

https://iris.ucl.ac.uk/research/publ...WBLAN93&page=1

Cheers,
Reply With Quote
  #68 (permalink)  
Old 01-12-2012, 01:56 AM
sfuerst
Guest
 
Posts: n/a
Default Re: Fast and small Multiplicative Random number generator

On Jan 11, 3:37*am, orz <cdh...@nospicedham.gmail.com> wrote:

> > > You effectively get an extra 4-8 bytes of static state, without having
> > > to store it in memory or even read it from memory, and those extra
> > > bits of state are guaranteed to never share the same value as any
> > > other RNG that is instantiated at the same time in the same address
> > > space.

>
> > I know. *It feels like cheating... *:-) *It is the only way I know to
> > get a random number generator to have a larger state-space than that
> > bounded by the size of its seed. *Breaking fundamental limits is
> > always entertaining.

>
> Actually, on second thought I want to qualify that slightly - you have
> no control over the value of that state, so you have to
> pessimistically assume it will be near-zero entropy, effectively far
> less than than a pointer size worth of entropy. *Really, the only
> thing you can count on is uniqueness (for the life the RNG instance).
> Which is, admittedly, pretty sweet. *But I shouldn't have described it
> in terms of the pointer size, size the pointer size is only a measure
> of the maximum entropy involved, and has very little to do, in
> practice, with the actual entropy involved.


Yep. All the testing I have done has assumed that number was zero.
If it is anything else, that's entropic gravy. Unfortunately, I
started discussing the other rather poor rng before testing had
completed, which confused things. The preliminary results did look
good at the time... (which is a rather bad excuse, but anyway...)

> > > My approach shares some similarities but is pretty different:
> > > 1. I have a large enough statespace so that initializing them to a
> > > random state is probabilistically good enough to eliminate
> > > correlations over a reasonable domain, regardless of the nature of the
> > > domain. *That generally means at least 3 variables worth of state for
> > > 16 to 64 bit RNGs, preferably 4 or 5. *And definitely more for 8 bit
> > > RNGs.
> > > 2a. I supply an automatic seeding mechanism to users. *They don't have
> > > to use it, but I encourage its use for cases where neither crypto
> > > security nor reproducible behavior is required. *They can still
> > > reproduce the RNG output by copying or serializing the RNG state, just
> > > not by saving the seed since there is no visible seed in the
> > > autoseeding case. *The automatic seeding mechanism initializes an
> > > internal RNG with a function of the program start time, the current
> > > time, the current value of the stack pointer (which is unique per-
> > > thread for the life of the thread), the address of the RNG it was
> > > called on (the same basic thing you were using!, but without the
> > > problems), the number of RNGs initialized within the current thread
> > > (if TLS variables are supported on the platform), a few other things,
> > > and on windows some platform specific stuff (I ought to add platform
> > > specific stuff for unix sometime...). *The internal RNG then
> > > initializes the users RNG to a random state, in a fashion that the
> > > users RNG algorithm can customize if it needs to (to preclude some
> > > states, for instance). *Some of those are precomputed at application
> > > launch time, others are done at autoseeding time. *The internal RNG
> > > algorithm is chosen to be semi-fast (on 64 bit anyway... slower on
> > > other CPUs), very high quality, guaranteed to have no short cycles,
> > > have high quality seeding, be quick to seed, and be capable of seeding
> > > from oddly structured data.
> > > 2b. I also supply some more complex tools for high quality seeding for
> > > cryptographic users and power users, who hopefully know what they're
> > > doing. *And I also supply a very simple seed-from-a-64-bit-integer
> > > interface, which sometimes maps to an RNGs reference seeding algorithm
> > > (if it has one that uses integers in the range supplied) and other
> > > times maps to something similar to the autoseeder except
> > > reproducible. *All RNGs get all of those seeding mechanisms on an
> > > identical interface. *Additionally many RNGs also offer other custom
> > > seeding mechanisms, and a few have default states that they go to
> > > deterministically if not seeded. *Those without default states require
> > > a seed in their constructor to prevent accidents unless the user
> > > explicitly requests otherwise (by passing a magic token to the
> > > constructor) or is using the "raw" version of the RNG without the
> > > normal interface. *If an RNG that does not have a default state is
> > > used without being seeded then its behavior is not defined.

>
> > This is the "standard" technique for parallel generators... so there
> > is nothing much wrong with it. *That doesn't mean that other (perhaps
> > simpler) methods don't exist. *Sometimes the new methods do end up
> > being better... *That's how progress happens. :-)

>
> Is it? *I've seen several of the individual elements of my scheme in
> other packages (TestU01, GSL, Boost, etc), but never more than one or
> two of the elements in a single package, and a few elements I've never
> seen elsewhere. *I've seen your xor-with-address scheme several times
> on the web, always combined with xorshift+multiply pairs, but never in
> a library that I can recall.


The "standard" probably is the SPRNG library, which seems to use the
state-space splitting technique. The other thing that pops up when
searching for parallel rng is my code, which uses the xor followed by
xorshift + multiply.

> > > No interthread or interprocess or intermachine communication is
> > > required. *It mostly works equally well for sequential RNG
> > > instantiations in the same thread, in different threads, on sequential
> > > runs, on simultaneous runs, on distributed machines, etc. *The weakest
> > > case at the moment is if a thread is created, then destroyed, then
> > > very quickly recreated with the same stack address and TLS address.
> > > Simultaneous launches on non-windows platforms are also weak. *Every
> > > other case that I can think of is strong, though not up to
> > > cryptographic standards. *The platform-specific stuff is done just
> > > once per application launch. *The requirements are:
> > > 1. Either the library initialization function or the first automatic
> > > seeding is performed before additional threads are launched. *That
> > > will tend to naturally be true for programs that don't go around doing
> > > crazy things with the launch and linking processes.
> > > 2. the degree of lack of correlation desired across the domain not
> > > exceed the square root of the statespace size. *That is almost
> > > identical to a fundamental requirement for high quality RNGs that
> > > would apply regardless, so it's no big deal.

>
> > You can fix the rapid recreate problem by using a guid. *Provided that
> > the guid updates rapidly enough, there is no way other users can have
> > the same initial seed state. *The trick is to make the creation stall
> > long enough to fulfill this speed limitation. *Since thread creation
> > is (meant to be) slow and rare, this isn't much of a constraint in
> > practice.

>
> If fixed the non-windows issues since my last post. *I'm still not
> sure what to do about the thread recreation issue. *One solution would
> be to use some sort of locking mechanism or atomic math, maybe
> EnterCriticalSection or InterlockedIncrement on Win32/MSVC. *But I'd
> like to stay very portable... I currently use no libraries aside from
> the C++ standard library (and, on windows, kernel32.lib), and nothing
> compiler or OS or ISA specific without ifdefs and fallback paths on
> unrecognized systems (admittedly the fallbacks are not very good for
> OSes that are neither windows nor *nix...). * Another solution would
> be a malloc(1) on a per-thread basis, with no corresponding free. *I'm
> not familiar with "guid", but it appears to refer to a variety of
> different algorithms for generating 128 bit numbers intended to be
> unique across process boundaries and often network boundaries.
> Apparently I would call UuidCreateSequential on win32, I'm not sure
> what I would do on *nix? *But that isn't really looking any better to
> me in terms of portability or speed than the other options. *Though it
> could replace the RtlGenRandom and/or /dev/random usage on program
> start.


A "guid" is a globally unique ID. Basically, it is some algorithm or
another that allows you to create a number (128 bits or higher,
usually) that is guaranteed to be unique. They use things like cycle
count and network MAC address to do it. If you use the 128 bits to
initialize part of the seed, that leaves many other bits for state-
space for the rng to work.

i.e. With the 128 bits, store into the upper 64 bit quantity in the
LCG, and into the 64 bits of "nonce", in rng_hash_128. Set the lower
64 bits of the LCG to be zero. This initialization gives you 2^64
random numbers before the rng can get into a state that another
instance could possibly be in.

>
> > > > Notice how it isn't using the adc instruction at all. *It also seems
> > > > to be rather poor code. :-( *The above benchmarks as taking 5.8s to do
> > > > a billion random numbers. *The asm version I gave takes 4.2s *Note
> > > > that the fastest of the other rng's you gave takes 4.6s on this
> > > > machine. *I don't begrudge the compile much though... even simple
> > > > rearrangements of the asm instructions cause the time taken to varyby
> > > > up to 30%. *It is very hard to come up with the right ordering to
> > > > overlap memory accesses with computation.

>
> > > 4.6 seconds, ow! *I know performance at the high end gets weird, but
> > > still I had hoped that at least one of the RNGs I posted would do
> > > better. *Several of them reliably take less then 2 seconds for thaton
> > > both my Core i5 2500 and my old Core2 3 GHrz using both 64 bit gcc and
> > > less than twice that using 32 bit MSVC. *All w/o asm.

>
> > My computer is old and slow. *Performance is relative, remember.

>
> Yeah. *Sigh. *Even just looking at it as relative performance... *Is
> your optimizer allowed to inline your stuff and/or remap register
> names, or does it have to stick to calling conventions and/or the code
> as you wrote it?


I deliberately made it so it couldn't. The asm is in a separate file
in my tests. The C equivalent is to use the noinline and noclone gcc
attributes. If you allow inlining, then benchmarking becomes
meaningless. The optimizer can do all sorts of things, including
eliding parts of the rng entirely. :-/

Since "typical" use will be via a function call into a library, this
isn't unrealistic.

> *On my benchmarks, if I force it to not inline the
> calls then they hit a performance maximum around 300 million calls per
> second (though it varies quite a bit, from a low around 100 to a high
> around 500, it's usually between 250 and 325), regardless of how fast
> the RNG itself is.


I don't see this. The difference between the slow and fast rng is
quite pronounced here. There also is very little variability in the
timings from one run to another. What does vary greatly is the effect
of instruction ordering. If I rearrange a couple of instructions, the
results can change by 20% or more. This makes optimizing a chore, as
many combinations need to be tested in order to find the best.

> *Even return 0 will get capped at that rate. *If I
> let it inline, even then it will still hit a cap around 1.1 billion
> calls per second even for very simple RNGs like "int random() {int tmp
> = a; a += b; return tmp;}". *A simple "return 0;" is faster still, but
> by so much that I suspect it optimizes away almost the entire calling
> loop in that instance.


That's the problem. You need to avoid letting the optimizer see the
rng code when it makes the code for the benchmark loop. Otherwise the
results are contaminated.

> > > The following variant of that:
> > > * * * * Word tmp = high;
> > > * * * * tmp ^= tmp >> (OUTPUT_BITS / 2);
> > > * * * * tmp *= K;
> > > * * * * tmp ^= rotate(tmp ^ low, OUTPUT_BITS/3);
> > > * * * * Word old = low;
> > > * * * * low += K;
> > > * * * * high += old + ((low < K) ? 1 : 0);
> > > * * * * return tmp;

>
> > > ...does much better on statistical tests when used on 32 bit integers
> > > instead of 64 bit integers. *For the full 64 bit versions there is no
> > > way to distinguish quality as both that and the original pass all
> > > tests I've tried (when the xored value wasn't low-entropy), but
> > > probably the 32 bit version is representative. *It still has the
> > > interstate correlation issues, but those are very hard to get rid of
> > > on fast random access RNGs, and no worse than the ones in the CLCG I'm
> > > looking to replace. *Adding a second multiply and another xorshift
> > > gets the interstate correlation issue under control without hurting
> > > performance intolerably - users who want performance faster than that
> > > can damn well use a non-random-access RNG. *I might also try combining
> > > successive elements like my hashed LFSR did to see if I can eliminate
> > > the need for a 2nd multiply that way. *I'm thinking of adding a third
> > > word of state doing something dumb to get the statespace a little
> > > bigger, but at this point really it's looking good enough to be worth
> > > replacing the CLCG.

>
> > I tried something much like the above, and it wasn't very good. *It
> > either had poor randomness, or failed to be fast enough to bother
> > with.

>
> That change had no performance impact on my computer (it's becoming
> increasing apparent that we see very different performance impact from
> changes), and did well on statistical tests when used on 32 bit
> integers where the original (when changed to use 32 bit integers) did
> poorly (IIRC failing after 5 minutes or so). *A few other tweaks got
> it to the point of passing an 8 hour test on 32 bit, I didn't try any
> longer tests.


As I've said... even small changes to the orderings of instructions
can give large relative changes in timing here. It is really
annoying.


>
> > Okay... after quite a bit of effort, I've found something much, much
> > better.

>
> > This is three different rng's in one:

>
> > .align 16
> > .globl rng_hash_128
> > .type *rng_hash_128,@function
> > rng_hash_128:
> > * * * * mov * *8(%rdi), %rax

>
> > * * * * movabs $0x6595a395a1ec531b, %rcx

>
> > # * Option 1)
> > # * Non-threaded, fastest. *No xor instruction used.

>
> > # * Option 2)
> > # * Threaded, use the address of the seed as a nonce.
> > # * * *xor * *%rdi, %rax

>
> > # * Option 3)
> > # * Threaded, pass a nonce as a second parameter.
> > # * * * xor * *%rsi, %rax
> > # *or have it stored in a larger seed:
> > # * * *xor * *16(%rdi), %rax

>
> > * * * * mov * *(%rdi), %rsi
> > * * * * mul * *%rcx

>
> > * * * * add * *%rcx, (%rdi)
> > * * * * adc * *%rsi, 8(%rdi)

>
> > * * * * xor * *%rsi, %rax
> > * * * * xor * *%rdx, %rax
> > * * * * mul * *%rcx
> > * * * * add * *%rsi, %rax
> > * * * * add * *%rdx, %rax

>
> > * * * * retq
> > .size * rng_hash_128, .-rng_hash_128

>
> > For non-threaded use, pick the one without the extra xor instruction.
> > This is fastest, taking 3.77s to output 10^9 64bit random numbers on
> > this machine.

>
> I get about 2.5 GB/s when I stick that in as inline asm on gcc. *About
> 300 million calls per second, about 10 cycles per call. *I used this
> code:
> Uint64 rv;
> asm volatile (
> * * * * "mov * *8(%%rdi), %%rax \n"
> * * * * "movabs $0x6595a395a1ec531b, %%rcx \n"
> * * * * "mov * *(%%rdi), %%rsi \n"
> * * * * "mul * *%%rcx \n"
> * * * * "add * *%%rcx, (%%rdi) \n"
> * * * * "adc * *%%rsi, 8(%%rdi) \n"
> * * * * "xor * *%%rsi, %%rax \n"
> * * * * "xor * *%%rdx, %%rax \n"
> * * * * "mul * *%%rcx \n"
> * * * * "add * *%%rsi, %%rax \n"
> * * * * "add * *%%rdx, %%rax \n"
> * * * * :"=a"(rv)//outputs
> * * * * :"D"(this)//inputs
> * * * * :"memory", "%rcx", "%rsi", "%rdx"//clobbers
> );
> return rv;
> If I left off the "volatile" keyword then the optimizer seemed to go
> berserk on non-vtable calls to it for some reason, though vtable calls
> remained the same. *Probably I screwed up the constraints somehow.


Why on earth is a vtable required? Can't you just use a normal
function call? That's what I do. The asm is compiled in a
separate .s file, and then linked in later.

> I also tried implementing it in C, but neither gcc nor MSVC will let
> me use a 128 bit integer. *I did implement it in C by scaling it down
> to 32 bit instead of 64 bit and compiling on a 32 bit compiler. *It
> wasn't particularly fast that way relative to other 32 bit RNGs.


Are you in 32 bit mode? You can use 128 bit integers in 64 bit mode
with gcc. If you are using 32 bits, you are out of luck. MSVC has
intrinsics which allow the use of 64x64->128 multiplies in 64 bit
mode.

>
> > So by picking a very fast f(), with poor randomness, and a good-enough
> > g() we can obtain a very fast rng that still passes stringent
> > statistical tests. *The problem is finding a good fast hash.
> > Fortunately, the extended-precision multiply used above works for g().
> > (A fast yet poor quality LCG easily works for f() )

>
> > Note: when re-implementing the above, be careful with the instruction
> > ordering. *Even small rearrangements can alter the speed by tens of
> > percent.

>
> > Steven

>
> On the complex-output-functions = better-parallelism argument I'm
> pretty meh. *Yeah, the output function can be done in parallel with
> the state transition function. *Yeah, the output function can be done
> in parallel with the next output function if the state transition is
> fast enough and the calling code is nice enough and the scheduler is
> psychic enough. *But there are severe limits to how much can be done
> at once, both due to limits of execution resources (large integer
> multiplication in particular takes a huge transistor & power budget)
> and due to scheduling difficulty. *My understanding is that most CPUs
> can't maintain a IPC more than 4 on any remotely real world code no
> matter how parallel it is, far lower if lots of complex operations
> like large integer multiplication are involved. *Notice how recent CPU
> generations haven't really been getting significantly more
> superscalar? *These days heat is the fundamental limiting factor, and
> executing all those operations will produce the same amount of heat
> regardless of how how smart the scheduler is... in fact a smarter
> scheduler will just add overhead to the thermal budget, so the next
> few CPU generations are not likely to emphasize OOO any more than the
> last few have (in the desktop world anyway...).
>
> Also, most fast RNGs with simple output functions have highly parallel
> state transition functions, typically 2 to 4 simple operations. *Your
> code there is a critical path of 2 simple operations on the state
> transition function, and 6 operations on the output function, two of
> which are high latency instructions. *mul 64x64->128 is reported to be
> 5 cycle latency on recent AMD, Intel's most recent optimization guide
> seems to claim a latency of 14-18 cycles on that but probably I'm
> misreading that somehow as it seems improbably high... a quick
> googling suggests it might be closer to 3 cycles. *Since you have two
> of them in your critical path I end up with a best-case performance on
> the order of 6-10 cycles). *If the output never actually gets used,
> that's a different story... the simple transition function might
> dominate then.


Be careful... I'm stating what I'm measuring, not what is
theoretically happening. For example, if I remove the final multiply
+ xor sequence, the time taken doesn't change. It appears that the
time taken to access L1 in the add, adc chain is much longer than
naive cycle-counting would predict.

Steven
Reply With Quote
  #69 (permalink)  
Old 01-12-2012, 02:22 AM
sfuerst
Guest
 
Posts: n/a
Default Re: Fast and small Multiplicative Random number generator

On Jan 11, 4:59*pm, orz <cdh...@nospicedham.gmail.com> wrote:
> On Jan 11, 3:37*am, orz <cdh...@nospicedham.gmail.com> wrote:
>
> > Also, most fast RNGs with simple output functions have highly parallel
> > state transition functions, typically 2 to 4 simple operations. *Your
> > code there is a critical path of 2 simple operations on the state
> > transition function, and 6 operations on the output function, two of
> > which are high latency instructions. *mul 64x64->128 is reported to be
> > 5 cycle latency on recent AMD, Intel's most recent optimization guide
> > seems to claim a latency of 14-18 cycles on that but probably I'm
> > misreading that somehow as it seems improbably high... a quick
> > googling suggests it might be closer to 3 cycles. *Since you have two
> > of them in your critical path I end up with a best-case performance on
> > the order of 6-10 cycles). *If the output never actually gets used,
> > that's a different story... the simple transition function might
> > dominate then.

>
> I was a bit tired when I wrote that, lots of typos and mistakes though
> I stand by the general concept. *The talk of latencies should have
> mentioned throughputs and ratios of throughputs to latencies. *Intel
> claims that mul 64x64->128 should have a latency of 4 to 10 (usually
> 10) cycles on core 7 or 7 cycles on core 2, and a throughput of 1
> cycle on core 7 or 3 cycles on core 2. *The ratio of those latencies
> to throughputs ranges from 2.3 to 10.0. *Which implies that executing
> sequential output functions in parallel with each other is a
> significant possibility for that RNG. *BTW, add-with-carry is listed
> as having significantly worse latency and throughput than regular
> add... add at 1 latency and 0.33 throughput on all core7s and core2s,
> adc at 2 latency and 1 to 1.5 throughput, cmp+cmov+add at 4 latency,
> 1.16 to 1.66 throughput (adding the throughputs of the individual
> instructions, though technically that's not correct).
>
> If we assume that the scheduling is perfect and there's no extra
> complications from the calling code or anything, then what matters is
> the larger of the overall functions throughput and the state
> transition functions critical path latency. *For that RNG, that looks
> like 3 cycles latency for the state transition function on most of
> Intel's desktop CPUs, and an overall throughput in the neighborhood of
> ~5 cycles on core7 or ~8 cycles on core2. *Compare that to, say, the
> RNG on Robert Jenkins small fast RNG design page, with a critical path
> on the state transition function of 1 rotate and 2 additions, and a
> grand total of 4 additions, 1 xor, and 2-3 rotates - a critical path
> latency of 3 cycles and an overall throughput of a little less than 3
> cycles.
>
> Of course, that's all based upon massively simplified models of the
> CPUs performance - reality rarely conforms to it. *But since our main
> alternative seems to be benchmarks that yield very different results
> on different computers...


Measurement is easier:

This machine is 2.4GHz Opteron. It takes 3.773s to complete 10^9
calls to the single-threaded version of the rng.

2.4 * 3.773 = 9.0552

Thus it takes 9 cycles per loop iteration. If you remove the loop
overhead it'll be faster...

Don't forget the L1 access latency in your cycle count prediction.
Registers are much faster than L1 on modern machines. The L1 latency
on this machine is 3 cycles. On I7, I'm fairly sure it is 4 cycles.

Steven
Reply With Quote
  #70 (permalink)  
Old 01-12-2012, 03:58 AM
orz
Guest
 
Posts: n/a
Default Re: Fast and small Multiplicative Random number generator

On Jan 11, 6:56*pm, sfuerst <svfue...@nospicedham.gmail.com> wrote:
> > Yeah. *Sigh. *Even just looking at it as relative performance... *Is
> > your optimizer allowed to inline your stuff and/or remap register
> > names, or does it have to stick to calling conventions and/or the code
> > as you wrote it?

>
> I deliberately made it so it couldn't. *The asm is in a separate file
> in my tests. *The C equivalent is to use the noinline and noclone gcc
> attributes. *If you allow inlining, then benchmarking becomes
> meaningless. *The optimizer can do all sorts of things, including
> eliding parts of the rng entirely. :-/
>
> Since "typical" use will be via a function call into a library, this
> isn't unrealistic.
>
> > *On my benchmarks, if I force it to not inline the
> > calls then they hit a performance maximum around 300 million calls per
> > second (though it varies quite a bit, from a low around 100 to a high
> > around 500, it's usually between 250 and 325), regardless of how fast
> > the RNG itself is.

>
> I don't see this. *The difference between the slow and fast rng is
> quite pronounced here. *There also is very little variability in the
> timings from one run to another. *What does vary greatly is the effect
> of instruction ordering. *If I rearrange a couple of instructions, the
> results can change by 20% or more. *This makes optimizing a chore, as
> many combinations need to be tested in order to find the best.
>
> > *Even return 0 will get capped at that rate. *If I
> > let it inline, even then it will still hit a cap around 1.1 billion
> > calls per second even for very simple RNGs like "int random() {int tmp
> > = a; a += b; return tmp;}". *A simple "return 0;" is faster still, but
> > by so much that I suspect it optimizes away almost the entire calling
> > loop in that instance.

>
> That's the problem. *You need to avoid letting the optimizer see the
> rng code when it makes the code for the benchmark loop. *Otherwise the
> results are contaminated.


Another area where we've come from similar goals to rather different
approaches.

1. My view is that inlining is in fact representative. MSVC will
inline static library calls (if the information needed is present...
which bloats the size of the .lib files by quite a bit), having them
as separate files is insufficient. I believe Intel C can do that as
well. gcc will not so far as I know, at the moment anyway, but that
could change after a few more gcc releases, and even now it will still
inline functions defined in the headers of libraries. I usually don't
put the algorithms in the headers, but when the algorithm is
implemented in a buffered fashion I do put the buffer reading portion
in to the header with only the buffer refill logic hidden in the .cpp
file. And for the simplest of algorithms, those that have very small
code size, I've been considering moving their implementation in to the
header.

2. To prevent the calls from being optimized away, I actually perform
math and IO dependent upon the result of the calls. Admittedly very
very simple math to prevent it from reducing the performance of the
loop, and IO that is absurdly unlikely to happen to prevent it from
cluttering up the output. Unfortunately, if the RNG is really REALLY
simple (ie "return 0;" or somesuch - anything with a state transition
function that doesn't do any transitioning) then the compiler can
actually figure out the results of the math without calling the RNG
repeatedly because all calls have the same result anyway. This is
intended to be representative of an RNG user that demands maximum
performance from an RNG - they need to do something with the result or
they wouldn't be calling the RNG in the first place, but if they want
the fastest possible RNG then they must be using the results for
something with low performance impact. Unfortunately, gcc seems to
have trouble with my loop when combined with inline asm, apparently
doing the math incorrectly... probably I defined the constraints on
the inline asm incorrectly somehow.

3. Like many RNG packages aimed at researchers, I support run-time RNG
selection via polymorphic interfaces. Unlike them I also offer the
same RNG algorithms by a non-polymorphic but otherwise identical
interface for maximum performance and minimum size. I benchmark all
RNG algorithms through both interfaces.

If I limit myself to non-inline RNG calls then my best performance on
this CPU drops down to about 3 seconds per billion calls. And your
latest RNG comes in first out of the set of RNGs I currently have
active in my benchmark program, which includes many of my fastest RNGs
(though I've been optimizing more for inline performance than non-
line, since the differences are pretty small in non-inline). Which
ones are fastest in non-inline contexts is pretty weird in my testing,
even weirder than inline RNG performance I think, and has very small
performance differences.

So basically, it looks like much of our performance difference is
simply that you are measuring and optimizing for non-inline
performance (representative for library calls on gcc), while I am
mostly optimizing for inline performance (representative on some
compilers, or if the core RNG algorithm is defined in a header) though
I measure both.

> > I also tried implementing it in C, but neither gcc nor MSVC will let
> > me use a 128 bit integer. *I did implement it in C by scaling it down
> > to 32 bit instead of 64 bit and compiling on a 32 bit compiler. *It
> > wasn't particularly fast that way relative to other 32 bit RNGs.

>
> Are you in 32 bit mode? *You can use 128 bit integers in 64 bit mode
> with gcc. *If you are using 32 bits, you are out of luck. *MSVC has
> intrinsics which allow the use of 64x64->128 multiplies in 64 bit
> mode.


I tried both 64 and 32 bit gcc. And I tried consulting the docs. If
I'm reading the docs correctly, the issue is that gcc does not support
128 bit integers in C++, only in C. WTF gcc? I suppose I could try
creating a separate .c file just for sticking in RNGs with 128 bit
integers and adding some glue logic to allow it to interface with my C+
+ testing framework but... really, that doesn't sound that
entertaining. Probably if I just wait a while someone will add 128
bit integer support to gcc.

>
>
> > > So by picking a very fast f(), with poor randomness, and a good-enough
> > > g() we can obtain a very fast rng that still passes stringent
> > > statistical tests. *The problem is finding a good fast hash.
> > > Fortunately, the extended-precision multiply used above works for g()..
> > > (A fast yet poor quality LCG easily works for f() )

>
> > > Note: when re-implementing the above, be careful with the instruction
> > > ordering. *Even small rearrangements can alter the speed by tens of
> > > percent.

>
> > > Steven

>
> > On the complex-output-functions = better-parallelism argument I'm
> > pretty meh. *Yeah, the output function can be done in parallel with
> > the state transition function. *Yeah, the output function can be done
> > in parallel with the next output function if the state transition is
> > fast enough and the calling code is nice enough and the scheduler is
> > psychic enough. *But there are severe limits to how much can be done
> > at once, both due to limits of execution resources (large integer
> > multiplication in particular takes a huge transistor & power budget)
> > and due to scheduling difficulty. *My understanding is that most CPUs
> > can't maintain a IPC more than 4 on any remotely real world code no
> > matter how parallel it is, far lower if lots of complex operations
> > like large integer multiplication are involved. *Notice how recent CPU
> > generations haven't really been getting significantly more
> > superscalar? *These days heat is the fundamental limiting factor, and
> > executing all those operations will produce the same amount of heat
> > regardless of how how smart the scheduler is... in fact a smarter
> > scheduler will just add overhead to the thermal budget, so the next
> > few CPU generations are not likely to emphasize OOO any more than the
> > last few have (in the desktop world anyway...).

>
> > Also, most fast RNGs with simple output functions have highly parallel
> > state transition functions, typically 2 to 4 simple operations. *Your
> > code there is a critical path of 2 simple operations on the state
> > transition function, and 6 operations on the output function, two of
> > which are high latency instructions. *mul 64x64->128 is reported to be
> > 5 cycle latency on recent AMD, Intel's most recent optimization guide
> > seems to claim a latency of 14-18 cycles on that but probably I'm
> > misreading that somehow as it seems improbably high... a quick
> > googling suggests it might be closer to 3 cycles. *Since you have two
> > of them in your critical path I end up with a best-case performance on
> > the order of 6-10 cycles). *If the output never actually gets used,
> > that's a different story... the simple transition function might
> > dominate then.

>
> Be careful... I'm stating what I'm measuring, not what is
> theoretically happening. *For example, if I remove the final multiply
> + xor sequence, the time taken doesn't change. *It appears that the
> time taken to access L1 in the add, adc chain is much longer than
> naive cycle-counting would predict.


Yeah, reality trumps theory. But, real world RNG using code might use
the output in some way that made the next call to the RNG dependent
upon the previous ones results, from the schedulers perspective
anyway. Or do something else to make latency more important relative
to throughput. And, are you actually using the output for anything?
The CPUs scheduler acts a little bit like the compilers optimizer... I
don't *think* it can completely discard scheduled instructions that
become meaningless due to later instructions, but maybe it's smarter
than I think? And, as mentioned above, I consider inline performance
at least as representative for small code size RNGs.
Reply With Quote
  #71 (permalink)  
Old 01-12-2012, 05:25 PM
Pedro Pereira
Guest
 
Posts: n/a
Default Re: Fast and small Multiplicative Random number generator

I've modified the xorshift with a period of 2^128-1,
adding two LFSR's with period 2^32-1 and 2^31-1 and
combined them with a J-K style flip-flop.

It passed the crush/bigcrush tests and is not too slow.

Feedback is welcome!

Pedro Pereira


static uint32_t state_0 = 0xEA2CBD78;
static uint32_t state_1 = 0x07BF10D5;
static uint32_t state_2 = 0x5FA0B00A;
static uint32_t state_3 = 0x06D98C3F;
static uint32_t lfsr_0 = 0x29EB68FE;
static uint32_t lfsr_1 = 0x4CD98AA6;
static uint32_t lfsr_01 = 0x22C3407B;
static uint32_t cycle = 0x5D8291F3;

static inline uint32_t lfsr(uint32_t n, uint32_t m)
{
return (n >> 1) ^ (-(n & 1) & m);
}

uint32_t get_pseudo_random(void)
{
uint32_t save;

save = state_0 ^ (state_0 << 13);
state_0 = state_1;
state_1 = state_2;
state_2 = state_3;
state_3 = state_3 ^ (state_3 >> 17) ^ (save ^ (save >> 7));

lfsr_0 = lfsr(lfsr_0, 0xE28F75ED);
lfsr_1 = lfsr(lfsr_1, 0x400005B6);
lfsr_01 = (lfsr_0 & ~lfsr_01) | (lfsr_1 & lfsr_01);

cycle += 0x5D31995A;
return (state_3 + lfsr_01) ^ cycle;
}
Reply With Quote
  #72 (permalink)  
Old 01-12-2012, 07:04 PM
orz
Guest
 
Posts: n/a
Default Re: Fast and small Multiplicative Random number generator

On Jan 12, 10:25*am, Pedro Pereira <ppere...@nospicedham.grupopie.com>
wrote:
> I've modified the xorshift with a period of 2^128-1,
> adding two LFSR's with period 2^32-1 and 2^31-1 and
> combined them with a J-K style flip-flop.
>
> It passed the crush/bigcrush tests and is not too slow.
>
> Feedback is welcome!
>
> Pedro Pereira
>
> static uint32_t state_0 = 0xEA2CBD78;
> static uint32_t state_1 = 0x07BF10D5;
> static uint32_t state_2 = 0x5FA0B00A;
> static uint32_t state_3 = 0x06D98C3F;
> static uint32_t lfsr_0 = 0x29EB68FE;
> static uint32_t lfsr_1 = 0x4CD98AA6;
> static uint32_t lfsr_01 = 0x22C3407B;
> static uint32_t cycle = 0x5D8291F3;
>
> static inline uint32_t lfsr(uint32_t n, uint32_t m)
> {
> * return (n >> 1) ^ (-(n & 1) & m);
>
> }
>
> uint32_t get_pseudo_random(void)
> {
> * uint32_t save;
>
> * save = state_0 ^ (state_0 << 13);
> * state_0 = state_1;
> * state_1 = state_2;
> * state_2 = state_3;
> * state_3 = state_3 ^ (state_3 >> 17) ^ (save ^ (save >> 7));
>
> * lfsr_0 = lfsr(lfsr_0, 0xE28F75ED);
> * lfsr_1 = lfsr(lfsr_1, 0x400005B6);
> * lfsr_01 = (lfsr_0 & ~lfsr_01) | (lfsr_1 & lfsr_01);
>
> * cycle += 0x5D31995A;
> * return (state_3 + lfsr_01) ^ cycle;
>
>
>
>
>
>
>
> }


The first LFSR is pretty good for an LFSR. Very good for an LFSR of
its state size. Not great in any absolute sense though - it's still
an LFSR.

The two small LFSRs are extremely low quality. The way they are
combined is rather unusual, but doesn't help on my tests.

The overall function is pretty slow. I could run a standard
cryptographic RNG instead and have faster output.


Broken down as return values vs data and/or CPU time to fail my
testing regime
state_3: 32 GB (8 minutes)
lfsr_0: 4 KB (< 1 second)
lfsr_1: 1 KB (the smallest unit my tests will even think of operating
on)
lfsr_01: 8 KB (< 1 second)
cycle: 1 KB (the smallest unit my tests will even think of operating
on)
lfsr_0 ^ cycle: 16 MB (< 1 second)
lfsr_1 ^ cycle: 2 MB (< 1 second)
lfsr_01 ^ cycle: 8 MB (< 1 second)
state_3 ^ cycle: passed 128 GB so far (half an hour)
state_3 + lfsr0: not yet tested
state_3 + lfsr1: not yet tested
state_3 + lfsr01: not yet tested
(state_3 + lfsr01) ^ cycle: passed 256 GB so far (an hour)

I've got 2 cores dedicated to testing variants atm, I'll post again
when they've filled out the rest of that table.
Reply With Quote
  #73 (permalink)  
Old 01-14-2012, 09:33 AM
orz
Guest
 
Posts: n/a
Default Re: Fast and small Multiplicative Random number generator

On Jan 12, 12:04*pm, orz <cdh...@nospicedham.gmail.com> wrote:
> On Jan 12, 10:25*am, Pedro Pereira <ppere...@nospicedham.grupopie.com>
> wrote:
>
>
>
>
>
>
>
>
>
> > I've modified the xorshift with a period of 2^128-1,
> > adding two LFSR's with period 2^32-1 and 2^31-1 and
> > combined them with a J-K style flip-flop.

>
> > It passed the crush/bigcrush tests and is not too slow.

>
> > Feedback is welcome!

>
> > Pedro Pereira

>
> > static uint32_t state_0 = 0xEA2CBD78;
> > static uint32_t state_1 = 0x07BF10D5;
> > static uint32_t state_2 = 0x5FA0B00A;
> > static uint32_t state_3 = 0x06D98C3F;
> > static uint32_t lfsr_0 = 0x29EB68FE;
> > static uint32_t lfsr_1 = 0x4CD98AA6;
> > static uint32_t lfsr_01 = 0x22C3407B;
> > static uint32_t cycle = 0x5D8291F3;

>
> > static inline uint32_t lfsr(uint32_t n, uint32_t m)
> > {
> > * return (n >> 1) ^ (-(n & 1) & m);

>
> > }

>
> > uint32_t get_pseudo_random(void)
> > {
> > * uint32_t save;

>
> > * save = state_0 ^ (state_0 << 13);
> > * state_0 = state_1;
> > * state_1 = state_2;
> > * state_2 = state_3;
> > * state_3 = state_3 ^ (state_3 >> 17) ^ (save ^ (save >> 7));

>
> > * lfsr_0 = lfsr(lfsr_0, 0xE28F75ED);
> > * lfsr_1 = lfsr(lfsr_1, 0x400005B6);
> > * lfsr_01 = (lfsr_0 & ~lfsr_01) | (lfsr_1 & lfsr_01);

>
> > * cycle += 0x5D31995A;
> > * return (state_3 + lfsr_01) ^ cycle;

>
> > }

>
> The first LFSR is pretty good for an LFSR. *Very good for an LFSR of
> its state size. *Not great in any absolute sense though - it's still
> an LFSR.
>
> The two small LFSRs are extremely low quality. *The way they are
> combined is rather unusual, but doesn't help on my tests.
>
> The overall function is pretty slow. *I could run a standard
> cryptographic RNG instead and have faster output.
>
> Broken down as return values vs data and/or CPU time to fail my
> testing regime
> state_3: 32 GB (8 minutes)
> lfsr_0: 4 KB (< 1 second)
> lfsr_1: 1 KB (the smallest unit my tests will even think of operating
> on)
> lfsr_01: 8 KB (< 1 second)
> cycle: 1 KB (the smallest unit my tests will even think of operating
> on)
> lfsr_0 ^ cycle: 16 MB (< 1 second)
> lfsr_1 ^ cycle: 2 MB (< 1 second)
> lfsr_01 ^ cycle: 8 MB (< 1 second)
> state_3 ^ cycle: passed 128 GB so far (half an hour)
> state_3 + lfsr0: not yet tested
> state_3 + lfsr1: not yet tested
> state_3 + lfsr01: not yet tested
> (state_3 + lfsr01) ^ cycle: passed 256 GB so far (an hour)
>
> I've got 2 cores dedicated to testing variants atm, I'll post again
> when they've filled out the rest of that table.


It's looking like state_3 combined with anything else passes all my
tests.

I am not fan of combining multiple RNGs in that way. I would rather
have every bit of state either effect or be effected by every other
bit of state, or both. When the RNG can be divided in to independent
component RNGs that tends to mean both decreased overall quality and
increased inter-state/seed correlation relative to the quality.

I want to back away from my earlier statement "I could run a standard
cryptographic RNG instead and have faster output". Really, the common
cryptographic RNGs are pretty slow in software - to find one faster
than this I have to basically cheat, picking ones that use larger
integer sizes and are not well accepted as secure. And while your RNG
is kinda slow in my book, it's still faster than mt19937 on gcc
(though slower on MSVC), so there's plenty of people who wouldn't
consider it slow - I have pretty high standards in that regard. Also,
leaving out lfsr_1 and lfsr_01 has, I think, fairly little cost in
output quality or statespace, and yields a significant improvement in
speed. Though it would substantially worsen interstate correlation
issues I think.

BTW, how did you choose the constants for the state_* LFSR? I'm
familiar with methods of finding maximal period constants and have a
few rules of thumb / common sense to avoid bad constants, but after
that I mostly stick to either empirical testing or avalanche-style
testing to distinguish good constants from bad constants. If I had a
better feel for where the constants came from I might scale it down to
working on 16 or 8 bit integers for testing purposes, as it's a lot
easier to find flaws when corner cases are more frequent and the state
size is closer to the log of the number of bins in my chi squared
tests.
Reply With Quote
  #74 (permalink)  
Old 01-23-2012, 06:23 PM
Pedro Pereira
Guest
 
Posts: n/a
Default Re: Fast and small Multiplicative Random number generator

orz wrote:

[...]

> It's looking like state_3 combined with anything else passes all my
> tests.
>
> I am not fan of combining multiple RNGs in that way. I would rather
> have every bit of state either effect or be effected by every other
> bit of state, or both. When the RNG can be divided in to independent
> component RNGs that tends to mean both decreased overall quality and
> increased inter-state/seed correlation relative to the quality.
>
> I want to back away from my earlier statement "I could run a standard
> cryptographic RNG instead and have faster output". Really, the common
> cryptographic RNGs are pretty slow in software - to find one faster
> than this I have to basically cheat, picking ones that use larger
> integer sizes and are not well accepted as secure. And while your RNG
> is kinda slow in my book, it's still faster than mt19937 on gcc
> (though slower on MSVC), so there's plenty of people who wouldn't
> consider it slow - I have pretty high standards in that regard. Also,
> leaving out lfsr_1 and lfsr_01 has, I think, fairly little cost in
> output quality or statespace, and yields a significant improvement in
> speed. Though it would substantially worsen interstate correlation
> issues I think.


Using a block cipher was the advantage of being easy to create
a random access pseudo random number generator (just use it in counter mode).
I tried to pick one, reduce the number of rounds and overall operations,
but I wasn't able to make it fast enough.

>
> BTW, how did you choose the constants for the state_* LFSR? I'm
> familiar with methods of finding maximal period constants and have a
> few rules of thumb / common sense to avoid bad constants, but after
> that I mostly stick to either empirical testing or avalanche-style
> testing to distinguish good constants from bad constants. If I had a
> better feel for where the constants came from I might scale it down to
> working on 16 or 8 bit integers for testing purposes, as it's a lot
> easier to find flaws when corner cases are more frequent and the state
> size is closer to the log of the number of bins in my chi squared
> tests.


I'm using the same method.

Pedro Pereira

Reply With Quote
  #75 (permalink)  
Old 04-06-2012, 12:24 AM
hopcode
Guest
 
Posts: n/a
Default Re: Fast and small Multiplicative Random number generator

Hi,
playing with a custom MWC algo

http://en.wikipedia.org/wiki/Multiply-with-carry

i have found something better (most times) than that
from Steven, using a reduced vector
of 32x8 bytes. the main difference
with that of Steven is that here you can
optimize or customize it without variating
to much from the main performances it offers.
it is called /pi_rand/ because i use a sample
from pi as vector

usage:

mov rcx,362436
xor edx,edx
call .pi_randA


..pi_randA:
;--- RSI point to a vector minimum EDX*8
;--- in ECX c
;--- in EDX i
;--- out EAX,ECX,EDX
;--- preserve ECX,EDX
;--------------------------

;--- i = (i + 1) & 4095;
;--- customize: RSI points to a vector
;--- large at least EDX*8
inc edx
and edx,01Fh ;--- 4095 original

mov rax,[rsi+rdx] ;--- t = a * Q[i] + c;
imul rax,18782

ror rax,11 ;<--- customize here or comment
add rax,rcx

add rcx,rax ;--- c = (t >> 32);
shr rcx,32

add eax,ecx ;--- x = t + c;

cmp eax,ecx ;--- if (x < c) {
sbb r8,r8
neg r8

adc ecx,0 ;--- c++;
add eax,r8d ;--- x++;

;--- (Q[i] = r - x);
sub eax,-2 ;--- sub x,r
neg eax ;--- neg x

mov [rsi+rdx],eax ;--- return (Q[i] = r - x);
ret

i cannot attach a plot because gnuplot runs out of memory
after 8 MiB binary data (sic!), when i need at least 16.

gnuplot> load "E:/gnuplot/demo/mydemoB.dem"
"E:/gnuplot/demo/mydemoB.dem", line 6: out of memory f
or expanding curve points

and,as far as it seems, the supporters too refuse categorically
to face that serious problem again; they discussed about it
in the remote 2009 A.D.

http://groups.google.com/group/comp....c0388d96bddbb5

and their bugged philosopy seems to be:
"gnuplot is not a data analysis tool, it is a plotting program"
oh,...
....really ? :-)

i wonder how could they start/continue coding a plotting
program without considering data,and how huge the bulk of datas may be.

Cheers,

..:mrk[hopcode]
.:x64lab:.
group http://groups.google.com/group/x64lab
site http://sites.google.com/site/x64lab

Reply With Quote
 
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off




All times are GMT. The time now is 02:25 AM.


Copyright ©2009

LinkBacks Enabled by vBSEO 3.3.0 RC2 © 2009, Crawlability, Inc.