|
|||
|
On Jan 5, 7:42*pm, sfuerst <svfue...@nospicedham.gmail.com> wrote:
> This is the rng I'm investigating now. *(Still running BigCrush tests > on various 32bit functions of the 64 bit output.) * It doesn't fit all > of your criteria, but it fits most, and is very very fast. > > It is a hash of the output of a 128 bit LCG. *The LCG uses a > multiplier of 2^64 + 1, so is easy to step. *Fast forwarding, and > reversing are trivial. *The hash is the most complex thing I've > managed to stuff into the cycles spent accessing L1 in the LCG update. > > My constraints are a little different than yours, so more is effort > has been "spent" making sure parallel users can be assured of > different sequences. *(Hence the imul as the fourth instruction.) > > .align 16 > .globl rng_new > .type *rng_new,@function > rng_new: > * * * * movabs *$0x6595a395a1ec531b, %rcx > > * * * * mov * *8(%rdi), %rax > > * * * * # Or you could have xor %rsi, %rax if process number is the second > parm > * * * * xor * *%rdi, %rax > * * * * imul * %rcx, %rax > > * * * * mov * *(%rdi), %rdx * # The LCG update > * * * * add * *%rcx, (%rdi) > * * * * adc * *%rdx, 8(%rdi) > > * * * * xor * *%rax, %rdx > * * * * ror * *$29, %rdx * *# Can use a rotate here due to the extra entropy > from %rdx > * * * * xor * *%rdx, %rax > * * * * mov * *%rax, %rdx > * * * * shl * *$31, %rdx > * * * * xor * *%rdx, %rax > * * * * mov * *%rax, %rdx > * * * * shr * *$23, %rdx > * * * * xor * *%rdx, %rax > * * * * retq > .size * rng_new, .-rng_new > > I've also looked at 192 and 256 bit variants, but getting a good fast > hash, or good fast LCG with that many bits is seemingly futile. *I'm > willing to be proved wrong though. :-) > > Steven Okay... after quite a bit of effort, I've found something much, much better. This is three different rng's in one: ..align 16 ..globl rng_hash_128 ..type rng_hash_128,@function rng_hash_128: mov 8(%rdi), %rax movabs $0x6595a395a1ec531b, %rcx # Option 1) # Non-threaded, fastest. No xor instruction used. # Option 2) # Threaded, use the address of the seed as a nonce. # xor %rdi, %rax # Option 3) # Threaded, pass a nonce as a second parameter. # xor %rsi, %rax # or have it stored in a larger seed: # xor 16(%rdi), %rax mov (%rdi), %rsi mul %rcx add %rcx, (%rdi) adc %rsi, 8(%rdi) xor %rsi, %rax xor %rdx, %rax mul %rcx add %rsi, %rax add %rdx, %rax retq ..size rng_hash_128, .-rng_hash_128 For non-threaded use, pick the one without the extra xor instruction. This is fastest, taking 3.77s to output 10^9 64bit random numbers on this machine. If you have multiple threads, then pick option two. The rng will use the different addresses of the seeds to make sure that each thread gets a different output sequence. This takes 4.05s to output 10^9 64bit random numbers on this machine. If you are using multiple computers, or require multithreaded repeatability, then pick option three. Pass a variable describing the thread/machine-number to the rng. (Another option is to store this extra state into a slightly expanded seed, and use "xor 16(%rdi), %rax" instead.) In effect, options two and three choose a 2^128 subsequence from a 2^192 element sequence. This rng passes TestU01 BigCrush with the top 32 bits, bottom 32 bits, middle 32 bits, and alternating bottom and top 32 bit bitstream, with a total of zero failures. It works by hashing the output of a 128 bit LCG, with a simple multiplier of 2^64 + 1 and addend of a "random" 64 bit odd constant. Feel free to try some other constant. As long as it has roughly half the bits set, and is odd, it should work well. Since the base is a LCG, fast-forwarding or reversing the rng is trivial. The hash is based on merging the upper and lower 64 bit halves of two 64x64->128 bit multiplies. This loses entropy... however since we also merge in another part of the seed, the extra entropy from that compensates. The result is an irreversible hash that seems to have very good statistical properties. Why is this rng so fast? Most normal random number generators use a mapping x_new = f(x), where the function f defines how the seed state x is altered. This rng instead of outputting (part of) the seed, outputs a hash g(x) of it instead. Thus f(x), and g(x) need to be calculated in every call. This looks obviously slower than the other method. However, two things save us. The first is that modern processors are out-of- order, and can execute many instructions in parallel. Thus the calculation of f() and g() can overlap. The second is that the constraints of f() and g() are different. f() needs to map to a large cycle of states, and have good statistical randomness. g() just needs to have good statistical randomness. So by picking a very fast f(), with poor randomness, and a good-enough g() we can obtain a very fast rng that still passes stringent statistical tests. The problem is finding a good fast hash. Fortunately, the extended-precision multiply used above works for g(). (A fast yet poor quality LCG easily works for f() ) Note: when re-implementing the above, be careful with the instruction ordering. Even small rearrangements can alter the speed by tens of percent. Steven |
|
|
||||
|
||||
|
|
|
|||
|
On Jan 8, 2:51*pm, Frank Kotler <fbkot...@myfairpoint.net> wrote:
> orz wrote: > > Hey, do you have any problems with people using any of this stuff, > > modifying it, etc, potentially in closed source or commercial apps? *I > > try to keep all of my recommended RNGs public domain or very close to > > public domain. > > Okay... > > I guess you guys have quit using rdtsc... rdtsc doesn't have enough entropy to be usable in a rapidly called rng. Instead, what you should do is gather the few bits of entropy that rdtsc has, and reseed the entire state of the rng once you have enough. Unfortunately, rdtsc is a rather slow instruction, so calling it every rng invocation is painful. > something's been nagging at my > mind... on my machine (P4), rdtsc always returns an even number... That seems really annoying. The most obvious assumption you could make is that the lowest bit of the cycle count is the most random. Steven |
|
|||
|
On Jan 10, 4:38 pm, sfuerst <svfue...@nospicedham.gmail.com> wrote:
> On Jan 6, 3:31 pm, orz <cdh...@nospicedham.gmail.com> wrote: > > > > > > > > > > > On Jan 6, 11:31 am, sfuerst <svfue...@nospicedham.gmail.com> wrote: > > > > > The interstate correlation seems bad at all addresses, including > > > > sustained interstate correlation. With a statesize of only 2**128 > > > > that will likely cause problems in a number of parallel usage > > > > scenarios. Though in some scenarios that can be made up for by the > > > > use of the address. Which is a practice that I disapprove of - it > > > > makes results non-reproducible, breaks some uses (including some ofmy > > > > RNG tests...), and makes testing harder (I could, for instance, have > > > > missed the problems that occur at some addresses). In short, it makes > > > > everything messier by violating the abstraction layer I expect RNGsto > > > > obey. The benefits it provides, while potentially comforting for some > > > > multithreaded users, are of very little utility when paired with high > > > > quality RNGs that are seeded correctly - I attempt to provide similar > > > > but more flexible benefits to my (non-existant) users through a > > > > combination of good RNG general properties (for that kind of purpose > > > > statespace size and interseed correlations in particular) and > > > > intelligent seeding interfaces. > > > > The advantage of using the address is that no communication is > > > required at all between threads to guarantee that they get different > > > sequences. Sequence-dividing schemes run into problems when you have > > > threads that fork off other threads, and the resulting number of users > > > is initially unknown. (Partitioning the seed-space becomes > > > impossible.) The other issue is multi-machine codes using something > > > like MPI. There, you have a number for each "rank", and need it to > > > make the rng give a different sequence. (That is the commented out > > > rsi as second parameter option.) > > > > Of course, since it seems that the quality of the resulting stream is > > > strongly affected by the initial xor value, something needs to be > > > fixed first. > > > I'm aware of the benefits. Fundamentally, the issue of trying to get > > each thread an uncorrelated sequence is identical to the issue of > > trying to get each run an uncorrelated sequence. You can solve it in > > an automated fashion for the specific case of per-thread RNGs in > > shared address space TLS by violating the abstraction layer, but that > > is no help at all on sequential runs, distributed runs, parallel runs, > > sequential threads, etc. Users who need decent random number > > generators may be doing monte carlo simulations or whatever that want > > a lack of correlation over any of those domains. Users who really > > know what they're doing will be hurt more than helped by your method > > (they might appreciate simplified thread seeding, but they don't want > > to have to remember the details of how the abstraction layer is > > violated), and those who don't will only be helped in a limited set of > > scenarios. > > I don't think things are that bad. Often the simple case will "just > work". Leave the complex interface for those who really need it, and > have sensible defaults that work for most users. > > > You effectively get an extra 4-8 bytes of static state, without having > > to store it in memory or even read it from memory, and those extra > > bits of state are guaranteed to never share the same value as any > > other RNG that is instantiated at the same time in the same address > > space. > > I know. It feels like cheating... :-) It is the only way I know to > get a random number generator to have a larger state-space than that > bounded by the size of its seed. Breaking fundamental limits is > always entertaining. > Actually, on second thought I want to qualify that slightly - you have no control over the value of that state, so you have to pessimistically assume it will be near-zero entropy, effectively far less than than a pointer size worth of entropy. Really, the only thing you can count on is uniqueness (for the life the RNG instance). Which is, admittedly, pretty sweet. But I shouldn't have described it in terms of the pointer size, size the pointer size is only a measure of the maximum entropy involved, and has very little to do, in practice, with the actual entropy involved. > > > > > > > > > My approach shares some similarities but is pretty different: > > 1. I have a large enough statespace so that initializing them to a > > random state is probabilistically good enough to eliminate > > correlations over a reasonable domain, regardless of the nature of the > > domain. That generally means at least 3 variables worth of state for > > 16 to 64 bit RNGs, preferably 4 or 5. And definitely more for 8 bit > > RNGs. > > 2a. I supply an automatic seeding mechanism to users. They don't have > > to use it, but I encourage its use for cases where neither crypto > > security nor reproducible behavior is required. They can still > > reproduce the RNG output by copying or serializing the RNG state, just > > not by saving the seed since there is no visible seed in the > > autoseeding case. The automatic seeding mechanism initializes an > > internal RNG with a function of the program start time, the current > > time, the current value of the stack pointer (which is unique per- > > thread for the life of the thread), the address of the RNG it was > > called on (the same basic thing you were using!, but without the > > problems), the number of RNGs initialized within the current thread > > (if TLS variables are supported on the platform), a few other things, > > and on windows some platform specific stuff (I ought to add platform > > specific stuff for unix sometime...). The internal RNG then > > initializes the users RNG to a random state, in a fashion that the > > users RNG algorithm can customize if it needs to (to preclude some > > states, for instance). Some of those are precomputed at application > > launch time, others are done at autoseeding time. The internal RNG > > algorithm is chosen to be semi-fast (on 64 bit anyway... slower on > > other CPUs), very high quality, guaranteed to have no short cycles, > > have high quality seeding, be quick to seed, and be capable of seeding > > from oddly structured data. > > 2b. I also supply some more complex tools for high quality seeding for > > cryptographic users and power users, who hopefully know what they're > > doing. And I also supply a very simple seed-from-a-64-bit-integer > > interface, which sometimes maps to an RNGs reference seeding algorithm > > (if it has one that uses integers in the range supplied) and other > > times maps to something similar to the autoseeder except > > reproducible. All RNGs get all of those seeding mechanisms on an > > identical interface. Additionally many RNGs also offer other custom > > seeding mechanisms, and a few have default states that they go to > > deterministically if not seeded. Those without default states require > > a seed in their constructor to prevent accidents unless the user > > explicitly requests otherwise (by passing a magic token to the > > constructor) or is using the "raw" version of the RNG without the > > normal interface. If an RNG that does not have a default state is > > used without being seeded then its behavior is not defined. > > This is the "standard" technique for parallel generators... so there > is nothing much wrong with it. That doesn't mean that other (perhaps > simpler) methods don't exist. Sometimes the new methods do end up > being better... That's how progress happens. :-) > Is it? I've seen several of the individual elements of my scheme in other packages (TestU01, GSL, Boost, etc), but never more than one or two of the elements in a single package, and a few elements I've never seen elsewhere. I've seen your xor-with-address scheme several times on the web, always combined with xorshift+multiply pairs, but never in a library that I can recall. I think that interface-wise, for the simple case, my scheme is as simple as anything that allows multiple RNG states (ie simpler than the libraries aimed at research, but not as simple as libc rand/ srand). Implementation-wise, my scheme is not simple... there ends up being a call to the system RNG at program start, and some additional hashing on a per-autoseeding basis and a tiny bit on a per-word-of-RNG- state-autoseeded basis. For most uses it's overkill, but since many users have a hard time figuring out just how broad a scope they need lack of correlation over, and since the resources used are cheap, I shoot for a little overkill. > > > > > > > > > No interthread or interprocess or intermachine communication is > > required. It mostly works equally well for sequential RNG > > instantiations in the same thread, in different threads, on sequential > > runs, on simultaneous runs, on distributed machines, etc. The weakest > > case at the moment is if a thread is created, then destroyed, then > > very quickly recreated with the same stack address and TLS address. > > Simultaneous launches on non-windows platforms are also weak. Every > > other case that I can think of is strong, though not up to > > cryptographic standards. The platform-specific stuff is done just > > once per application launch. The requirements are: > > 1. Either the library initialization function or the first automatic > > seeding is performed before additional threads are launched. That > > will tend to naturally be true for programs that don't go around doing > > crazy things with the launch and linking processes. > > 2. the degree of lack of correlation desired across the domain not > > exceed the square root of the statespace size. That is almost > > identical to a fundamental requirement for high quality RNGs that > > would apply regardless, so it's no big deal. > > You can fix the rapid recreate problem by using a guid. Provided that > the guid updates rapidly enough, there is no way other users can have > the same initial seed state. The trick is to make the creation stall > long enough to fulfill this speed limitation. Since thread creation > is (meant to be) slow and rare, this isn't much of a constraint in > practice. If fixed the non-windows issues since my last post. I'm still not sure what to do about the thread recreation issue. One solution would be to use some sort of locking mechanism or atomic math, maybe EnterCriticalSection or InterlockedIncrement on Win32/MSVC. But I'd like to stay very portable... I currently use no libraries aside from the C++ standard library (and, on windows, kernel32.lib), and nothing compiler or OS or ISA specific without ifdefs and fallback paths on unrecognized systems (admittedly the fallbacks are not very good for OSes that are neither windows nor *nix...). Another solution would be a malloc(1) on a per-thread basis, with no corresponding free. I'm not familiar with "guid", but it appears to refer to a variety of different algorithms for generating 128 bit numbers intended to be unique across process boundaries and often network boundaries. Apparently I would call UuidCreateSequential on win32, I'm not sure what I would do on *nix? But that isn't really looking any better to me in terms of portability or speed than the other options. Though it could replace the RtlGenRandom and/or /dev/random usage on program start. > > The typical multithreaded user writes something like: > > > typedef PractRand::RNGs::Polymorphic::jsf64 RNG_TYPE; > > __thread RNG_TYPE rng(PractRand::SEED_AUTO); > > > ...and is done, with the same basic benefits as your approach. The > > more sophisticated user can do more sophisticated things though, no > > one will be fooled in to thinking that a nondeterministic case should > > behave deterministically, and serializing the RNG state for later > > reproduction will work properly (which will let some of my more exotic > > RNG tests behave correctly on it). > > I'll admit that the interface (if you can call it that) for the form > of rng I've proposed isn't well-suited for C++. However, since I > don't like C++ very much, I consider that a feature rather than a > bug. :-P > I don't think it has issues with C++ really. Well, I suppose arguably with object orientation in general since its state can't (or shouldn't) really be treated like a normal object. But C++ would handle it fine a simple function instead of a class. What I would say is that it lacks reproducibility (via either manual seeding or state serialization). RNG users sometimes want reproducibility for debugging purposes, network games and apps, hashing purposes, probably a few other uses I can't think of atm. > > > > > > > > Notice how it isn't using the adc instruction at all. It also seems > > > to be rather poor code. :-( The above benchmarks as taking 5.8s to do > > > a billion random numbers. The asm version I gave takes 4.2s Note > > > that the fastest of the other rng's you gave takes 4.6s on this > > > machine. I don't begrudge the compile much though... even simple > > > rearrangements of the asm instructions cause the time taken to vary by > > > up to 30%. It is very hard to come up with the right ordering to > > > overlap memory accesses with computation. > > > 4.6 seconds, ow! I know performance at the high end gets weird, but > > still I had hoped that at least one of the RNGs I posted would do > > better. Several of them reliably take less then 2 seconds for that on > > both my Core i5 2500 and my old Core2 3 GHrz using both 64 bit gcc and > > less than twice that using 32 bit MSVC. All w/o asm. > > My computer is old and slow. Performance is relative, remember. Yeah. Sigh. Even just looking at it as relative performance... Is your optimizer allowed to inline your stuff and/or remap register names, or does it have to stick to calling conventions and/or the code as you wrote it? On my benchmarks, if I force it to not inline the calls then they hit a performance maximum around 300 million calls per second (though it varies quite a bit, from a low around 100 to a high around 500, it's usually between 250 and 325), regardless of how fast the RNG itself is. Even return 0 will get capped at that rate. If I let it inline, even then it will still hit a cap around 1.1 billion calls per second even for very simple RNGs like "int random() {int tmp = a; a += b; return tmp;}". A simple "return 0;" is faster still, but by so much that I suspect it optimizes away almost the entire calling loop in that instance. And a few of the RNGs I've posted in this thread get 0.7 to 1.0 billion calls per second here - just under what looks like the maximum possible performance on this CPU / compiler pair. MSVC results are very similar, though it will inline even across compilation unit boundaries, unlike gcc (4.5.4) - I only see the lower cap on MSVC if the call goes through a vtable. In my attempts to achieve more meaningful performance measures I've tried to look for aggregate test results across multiple compilers, multiple CPU generations, multiple structures of calling loops, etc. I had thought it helped a little, but it's hard to say really, and often that gets tiresome and I'll skip most of it for extended periods of time. > > > Once I find a version with slightly better properties, I'll think > > > about finding something the C compiler can convert into good asm. > > > Previously, the trick was to use a union with an intrinsic 128 bit > > > unsigned integer type. > > > > > --- > > > > BTW, side note, is that the correct usage of movabs? In your code it > > > > is used to move a constant in to a register, but my quick googling of > > > > it suggested that it was only for reading from memory at absolute > > > > addresses? Really I don't know much about x86-64 asm so I could > > > > easily have just misunderstood what I read. > > > > "movabs "is the instruction gcc uses to move a 64 bit immediate into a > > > 64 bit register, as you can see from the above. Don't ask me why they > > > call it that rather than just plain old "mov". :-/ > > > > Steven > > > The following variant of that: > > Word tmp = high; > > tmp ^= tmp >> (OUTPUT_BITS / 2); > > tmp *= K; > > tmp ^= rotate(tmp ^ low, OUTPUT_BITS/3); > > Word old = low; > > low += K; > > high += old + ((low < K) ? 1 : 0); > > return tmp; > > > ...does much better on statistical tests when used on 32 bit integers > > instead of 64 bit integers. For the full 64 bit versions there is no > > way to distinguish quality as both that and the original pass all > > tests I've tried (when the xored value wasn't low-entropy), but > > probably the 32 bit version is representative. It still has the > > interstate correlation issues, but those are very hard to get rid of > > on fast random access RNGs, and no worse than the ones in the CLCG I'm > > looking to replace. Adding a second multiply and another xorshift > > gets the interstate correlation issue under control without hurting > > performance intolerably - users who want performance faster than that > > can damn well use a non-random-access RNG. I might also try combining > > successive elements like my hashed LFSR did to see if I can eliminate > > the need for a 2nd multiply that way. I'm thinking of adding a third > > word of state doing something dumb to get the statespace a little > > bigger, but at this point really it's looking good enough to be worth > > replacing the CLCG. > > I tried something much like the above, and it wasn't very good. It > either had poor randomness, or failed to be fast enough to bother > with. > That change had no performance impact on my computer (it's becoming increasing apparent that we see very different performance impact from changes), and did well on statistical tests when used on 32 bit integers where the original (when changed to use 32 bit integers) did poorly (IIRC failing after 5 minutes or so). A few other tweaks got it to the point of passing an 8 hour test on 32 bit, I didn't try any longer tests. > > > Hey, do you have any problems with people using any of this stuff, > > modifying it, etc, potentially in closed source or commercial apps? I > > try to keep all of my recommended RNGs public domain or very close to > > public domain. > > No problems. It is just math, after all. > > Steven Excellent, I'll probably be using something in the general ballpark of this stuff for random access RNGs for the time being, until I can figure out a way to get good random access without multiplication. On Jan 10, 5:19*pm, sfuerst <svfue...@nospicedham.gmail.com> wrote: > On Jan 5, 7:42*pm, sfuerst <svfue...@nospicedham.gmail.com> wrote: > > > > > > > > > > > This is the rng I'm investigating now. *(Still running BigCrush tests > > on various 32bit functions of the 64 bit output.) * It doesn't fit all > > of your criteria, but it fits most, and is very very fast. > > > It is a hash of the output of a 128 bit LCG. *The LCG uses a > > multiplier of 2^64 + 1, so is easy to step. *Fast forwarding, and > > reversing are trivial. *The hash is the most complex thing I've > > managed to stuff into the cycles spent accessing L1 in the LCG update. > > > My constraints are a little different than yours, so more is effort > > has been "spent" making sure parallel users can be assured of > > different sequences. *(Hence the imul as the fourth instruction.) > > > .align 16 > > .globl rng_new > > .type *rng_new,@function > > rng_new: > > * * * * movabs *$0x6595a395a1ec531b, %rcx > > > * * * * mov * *8(%rdi), %rax > > > * * * * # Or you could have xor %rsi, %rax if process number isthe second > > parm > > * * * * xor * *%rdi, %rax > > * * * * imul * %rcx, %rax > > > * * * * mov * *(%rdi), %rdx * # The LCG update > > * * * * add * *%rcx, (%rdi) > > * * * * adc * *%rdx, 8(%rdi) > > > * * * * xor * *%rax, %rdx > > * * * * ror * *$29, %rdx * *# Can use a rotate here dueto the extra entropy > > from %rdx > > * * * * xor * *%rdx, %rax > > * * * * mov * *%rax, %rdx > > * * * * shl * *$31, %rdx > > * * * * xor * *%rdx, %rax > > * * * * mov * *%rax, %rdx > > * * * * shr * *$23, %rdx > > * * * * xor * *%rdx, %rax > > * * * * retq > > .size * rng_new, .-rng_new > > > I've also looked at 192 and 256 bit variants, but getting a good fast > > hash, or good fast LCG with that many bits is seemingly futile. *I'm > > willing to be proved wrong though. :-) > > > Steven > > Okay... after quite a bit of effort, I've found something much, much > better. > > This is three different rng's in one: > > .align 16 > .globl rng_hash_128 > .type *rng_hash_128,@function > rng_hash_128: > * * * * mov * *8(%rdi), %rax > > * * * * movabs $0x6595a395a1ec531b, %rcx > > # * Option 1) > # * Non-threaded, fastest. *No xor instruction used. > > # * Option 2) > # * Threaded, use the address of the seed as a nonce. > # * * *xor * *%rdi, %rax > > # * Option 3) > # * Threaded, pass a nonce as a second parameter. > # * * * xor * *%rsi, %rax > # *or have it stored in a larger seed: > # * * *xor * *16(%rdi), %rax > > * * * * mov * *(%rdi), %rsi > * * * * mul * *%rcx > > * * * * add * *%rcx, (%rdi) > * * * * adc * *%rsi, 8(%rdi) > > * * * * xor * *%rsi, %rax > * * * * xor * *%rdx, %rax > * * * * mul * *%rcx > * * * * add * *%rsi, %rax > * * * * add * *%rdx, %rax > > * * * * retq > .size * rng_hash_128, .-rng_hash_128 > > For non-threaded use, pick the one without the extra xor instruction. > This is fastest, taking 3.77s to output 10^9 64bit random numbers on > this machine. I get about 2.5 GB/s when I stick that in as inline asm on gcc. About 300 million calls per second, about 10 cycles per call. I used this code: Uint64 rv; asm volatile ( "mov 8(%%rdi), %%rax \n" "movabs $0x6595a395a1ec531b, %%rcx \n" "mov (%%rdi), %%rsi \n" "mul %%rcx \n" "add %%rcx, (%%rdi) \n" "adc %%rsi, 8(%%rdi) \n" "xor %%rsi, %%rax \n" "xor %%rdx, %%rax \n" "mul %%rcx \n" "add %%rsi, %%rax \n" "add %%rdx, %%rax \n" :"=a"(rv)//outputs :"D"(this)//inputs :"memory", "%rcx", "%rsi", "%rdx"//clobbers ); return rv; If I left off the "volatile" keyword then the optimizer seemed to go berserk on non-vtable calls to it for some reason, though vtable calls remained the same. Probably I screwed up the constraints somehow. I also tried implementing it in C, but neither gcc nor MSVC will let me use a 128 bit integer. I did implement it in C by scaling it down to 32 bit instead of 64 bit and compiling on a 32 bit compiler. It wasn't particularly fast that way relative to other 32 bit RNGs. Count it as one more bit of speed-for-you != speed-for-me, not even relative speed. Either that or my inline asm sucked, and my compiler sucked too. > If you have multiple threads, then pick option two. *The rng will use > the different addresses of the seeds to make sure that each thread > gets a different output sequence. *This takes 4.05s to output 10^9 > 64bit random numbers on this machine. > > If you are using multiple computers, or require multithreaded > repeatability, then pick option three. *Pass a variable describing the > thread/machine-number to the rng. *(Another option is to store this > extra state into a slightly expanded seed, and use "xor * *16(%rdi), > %rax" instead.) > > In effect, options two and three choose a 2^128 subsequence from a > 2^192 element sequence. > > This rng passes TestU01 BigCrush with the top 32 bits, bottom 32 bits, > middle 32 bits, and alternating bottom and top 32 bit bitstream, with > a total of zero failures. My testing is reporting good quality for it as well. I'm testing both that algorithm in asm, plus that algorithm in C scaled down to operate on 32 bit integers instead of 64. > It works by hashing the output of a 128 bit LCG, with a simple > multiplier of 2^64 + 1 and addend of a "random" 64 bit odd constant. > Feel free to try some other constant. *As long as it has roughly half > the bits set, and is odd, it should work well. * Since the base is a > LCG, fast-forwarding or reversing the rng is trivial. > > The hash is based on merging the upper and lower 64 bit halves of two > 64x64->128 bit multiplies. *This loses entropy... however since we > also merge in another part of the seed, the extra entropy from that > compensates. *The result is an irreversible hash that seems to have > very good statistical properties. > > Why is this rng so fast? > > Most normal random number generators use a mapping x_new = f(x), where > the function f defines how the seed state x is altered. > > This rng instead of outputting (part of) the seed, outputs a hash g(x) > of it instead. *Thus f(x), and g(x) need to be calculated in every > call. *This looks obviously slower than the other method. *However, > two things save us. *The first is that modern processors are out-of- > order, and can execute many instructions in parallel. *Thus the > calculation of f() and g() can overlap. *The second is that the > constraints of f() and g() are different. > > f() needs to map to a large cycle of states, and have good statistical > randomness. *g() just needs to have good statistical randomness. > > So by picking a very fast f(), with poor randomness, and a good-enough > g() we can obtain a very fast rng that still passes stringent > statistical tests. *The problem is finding a good fast hash. > Fortunately, the extended-precision multiply used above works for g(). > (A fast yet poor quality LCG easily works for f() ) > > Note: when re-implementing the above, be careful with the instruction > ordering. *Even small rearrangements can alter the speed by tens of > percent. > > Steven On the complex-output-functions = better-parallelism argument I'm pretty meh. Yeah, the output function can be done in parallel with the state transition function. Yeah, the output function can be done in parallel with the next output function if the state transition is fast enough and the calling code is nice enough and the scheduler is psychic enough. But there are severe limits to how much can be done at once, both due to limits of execution resources (large integer multiplication in particular takes a huge transistor & power budget) and due to scheduling difficulty. My understanding is that most CPUs can't maintain a IPC more than 4 on any remotely real world code no matter how parallel it is, far lower if lots of complex operations like large integer multiplication are involved. Notice how recent CPU generations haven't really been getting significantly more superscalar? These days heat is the fundamental limiting factor, and executing all those operations will produce the same amount of heat regardless of how how smart the scheduler is... in fact a smarter scheduler will just add overhead to the thermal budget, so the next few CPU generations are not likely to emphasize OOO any more than the last few have (in the desktop world anyway...). Also, most fast RNGs with simple output functions have highly parallel state transition functions, typically 2 to 4 simple operations. Your code there is a critical path of 2 simple operations on the state transition function, and 6 operations on the output function, two of which are high latency instructions. mul 64x64->128 is reported to be 5 cycle latency on recent AMD, Intel's most recent optimization guide seems to claim a latency of 14-18 cycles on that but probably I'm misreading that somehow as it seems improbably high... a quick googling suggests it might be closer to 3 cycles. Since you have two of them in your critical path I end up with a best-case performance on the order of 6-10 cycles). If the output never actually gets used, that's a different story... the simple transition function might dominate then. |
|
|||
|
On Jan 10, 5:19*pm, sfuerst <svfue...@nospicedham.gmail.com> wrote:
> Note: when re-implementing the above, be careful with the instruction > ordering. *Even small rearrangements can alter the speed by tens of > percent. forgot to ask: what CPU are you using? |
|
|||
|
On Jan 11, 3:37*am, orz <cdh...@nospicedham.gmail.com> wrote:
> Also, most fast RNGs with simple output functions have highly parallel > state transition functions, typically 2 to 4 simple operations. *Your > code there is a critical path of 2 simple operations on the state > transition function, and 6 operations on the output function, two of > which are high latency instructions. *mul 64x64->128 is reported to be > 5 cycle latency on recent AMD, Intel's most recent optimization guide > seems to claim a latency of 14-18 cycles on that but probably I'm > misreading that somehow as it seems improbably high... a quick > googling suggests it might be closer to 3 cycles. *Since you have two > of them in your critical path I end up with a best-case performance on > the order of 6-10 cycles). *If the output never actually gets used, > that's a different story... the simple transition function might > dominate then. I was a bit tired when I wrote that, lots of typos and mistakes though I stand by the general concept. The talk of latencies should have mentioned throughputs and ratios of throughputs to latencies. Intel claims that mul 64x64->128 should have a latency of 4 to 10 (usually 10) cycles on core 7 or 7 cycles on core 2, and a throughput of 1 cycle on core 7 or 3 cycles on core 2. The ratio of those latencies to throughputs ranges from 2.3 to 10.0. Which implies that executing sequential output functions in parallel with each other is a significant possibility for that RNG. BTW, add-with-carry is listed as having significantly worse latency and throughput than regular add... add at 1 latency and 0.33 throughput on all core7s and core2s, adc at 2 latency and 1 to 1.5 throughput, cmp+cmov+add at 4 latency, 1.16 to 1.66 throughput (adding the throughputs of the individual instructions, though technically that's not correct). If we assume that the scheduling is perfect and there's no extra complications from the calling code or anything, then what matters is the larger of the overall functions throughput and the state transition functions critical path latency. For that RNG, that looks like 3 cycles latency for the state transition function on most of Intel's desktop CPUs, and an overall throughput in the neighborhood of ~5 cycles on core7 or ~8 cycles on core2. Compare that to, say, the RNG on Robert Jenkins small fast RNG design page, with a critical path on the state transition function of 1 rotate and 2 additions, and a grand total of 4 additions, 1 xor, and 2-3 rotates - a critical path latency of 3 cycles and an overall throughput of a little less than 3 cycles. Of course, that's all based upon massively simplified models of the CPUs performance - reality rarely conforms to it. But since our main alternative seems to be benchmarks that yield very different results on different computers... |
|
|||
|
On Jan 11, 3:39*am, orz <cdh...@nospicedham.gmail.com> wrote:
> On Jan 10, 5:19*pm, sfuerst <svfue...@nospicedham.gmail.com> wrote: > > > Note: when re-implementing the above, be careful with the instruction > > ordering. *Even small rearrangements can alter the speed by tens of > > percent. > > forgot to ask: what CPU are you using? An Opteron 280. It obviously has quite a different behaviour than the Intel machine you are testing on. Steven |
|
|||
|
something gone wrong by posting with google...
ok, i post it again On 12.01.2012 01:59, orz wrote: > Of course, that's all based upon massively simplified models of the > CPUs performance - reality rarely conforms to it. But since our main > alternative seems to be benchmarks that yield very different results > on different computers... i think struggling on single CPU instructions is not so worth the while nowadays. SIMD SSE may give some advantages, when using only the CPU. btw, an exaustive paper benchmarking the Park-Miller using CUDA. http://www.cs.ucl.ac.uk/staff/W.Lang...2009_CIGPU.pdf and B.Langdon's publications, interesting imho. https://iris.ucl.ac.uk/research/publ...WBLAN93&page=1 Cheers, |
|
|||
|
On Jan 11, 3:37*am, orz <cdh...@nospicedham.gmail.com> wrote:
> > > You effectively get an extra 4-8 bytes of static state, without having > > > to store it in memory or even read it from memory, and those extra > > > bits of state are guaranteed to never share the same value as any > > > other RNG that is instantiated at the same time in the same address > > > space. > > > I know. *It feels like cheating... *:-) *It is the only way I know to > > get a random number generator to have a larger state-space than that > > bounded by the size of its seed. *Breaking fundamental limits is > > always entertaining. > > Actually, on second thought I want to qualify that slightly - you have > no control over the value of that state, so you have to > pessimistically assume it will be near-zero entropy, effectively far > less than than a pointer size worth of entropy. *Really, the only > thing you can count on is uniqueness (for the life the RNG instance). > Which is, admittedly, pretty sweet. *But I shouldn't have described it > in terms of the pointer size, size the pointer size is only a measure > of the maximum entropy involved, and has very little to do, in > practice, with the actual entropy involved. Yep. All the testing I have done has assumed that number was zero. If it is anything else, that's entropic gravy. Unfortunately, I started discussing the other rather poor rng before testing had completed, which confused things. The preliminary results did look good at the time... (which is a rather bad excuse, but anyway...) > > > My approach shares some similarities but is pretty different: > > > 1. I have a large enough statespace so that initializing them to a > > > random state is probabilistically good enough to eliminate > > > correlations over a reasonable domain, regardless of the nature of the > > > domain. *That generally means at least 3 variables worth of state for > > > 16 to 64 bit RNGs, preferably 4 or 5. *And definitely more for 8 bit > > > RNGs. > > > 2a. I supply an automatic seeding mechanism to users. *They don't have > > > to use it, but I encourage its use for cases where neither crypto > > > security nor reproducible behavior is required. *They can still > > > reproduce the RNG output by copying or serializing the RNG state, just > > > not by saving the seed since there is no visible seed in the > > > autoseeding case. *The automatic seeding mechanism initializes an > > > internal RNG with a function of the program start time, the current > > > time, the current value of the stack pointer (which is unique per- > > > thread for the life of the thread), the address of the RNG it was > > > called on (the same basic thing you were using!, but without the > > > problems), the number of RNGs initialized within the current thread > > > (if TLS variables are supported on the platform), a few other things, > > > and on windows some platform specific stuff (I ought to add platform > > > specific stuff for unix sometime...). *The internal RNG then > > > initializes the users RNG to a random state, in a fashion that the > > > users RNG algorithm can customize if it needs to (to preclude some > > > states, for instance). *Some of those are precomputed at application > > > launch time, others are done at autoseeding time. *The internal RNG > > > algorithm is chosen to be semi-fast (on 64 bit anyway... slower on > > > other CPUs), very high quality, guaranteed to have no short cycles, > > > have high quality seeding, be quick to seed, and be capable of seeding > > > from oddly structured data. > > > 2b. I also supply some more complex tools for high quality seeding for > > > cryptographic users and power users, who hopefully know what they're > > > doing. *And I also supply a very simple seed-from-a-64-bit-integer > > > interface, which sometimes maps to an RNGs reference seeding algorithm > > > (if it has one that uses integers in the range supplied) and other > > > times maps to something similar to the autoseeder except > > > reproducible. *All RNGs get all of those seeding mechanisms on an > > > identical interface. *Additionally many RNGs also offer other custom > > > seeding mechanisms, and a few have default states that they go to > > > deterministically if not seeded. *Those without default states require > > > a seed in their constructor to prevent accidents unless the user > > > explicitly requests otherwise (by passing a magic token to the > > > constructor) or is using the "raw" version of the RNG without the > > > normal interface. *If an RNG that does not have a default state is > > > used without being seeded then its behavior is not defined. > > > This is the "standard" technique for parallel generators... so there > > is nothing much wrong with it. *That doesn't mean that other (perhaps > > simpler) methods don't exist. *Sometimes the new methods do end up > > being better... *That's how progress happens. :-) > > Is it? *I've seen several of the individual elements of my scheme in > other packages (TestU01, GSL, Boost, etc), but never more than one or > two of the elements in a single package, and a few elements I've never > seen elsewhere. *I've seen your xor-with-address scheme several times > on the web, always combined with xorshift+multiply pairs, but never in > a library that I can recall. The "standard" probably is the SPRNG library, which seems to use the state-space splitting technique. The other thing that pops up when searching for parallel rng is my code, which uses the xor followed by xorshift + multiply. > > > No interthread or interprocess or intermachine communication is > > > required. *It mostly works equally well for sequential RNG > > > instantiations in the same thread, in different threads, on sequential > > > runs, on simultaneous runs, on distributed machines, etc. *The weakest > > > case at the moment is if a thread is created, then destroyed, then > > > very quickly recreated with the same stack address and TLS address. > > > Simultaneous launches on non-windows platforms are also weak. *Every > > > other case that I can think of is strong, though not up to > > > cryptographic standards. *The platform-specific stuff is done just > > > once per application launch. *The requirements are: > > > 1. Either the library initialization function or the first automatic > > > seeding is performed before additional threads are launched. *That > > > will tend to naturally be true for programs that don't go around doing > > > crazy things with the launch and linking processes. > > > 2. the degree of lack of correlation desired across the domain not > > > exceed the square root of the statespace size. *That is almost > > > identical to a fundamental requirement for high quality RNGs that > > > would apply regardless, so it's no big deal. > > > You can fix the rapid recreate problem by using a guid. *Provided that > > the guid updates rapidly enough, there is no way other users can have > > the same initial seed state. *The trick is to make the creation stall > > long enough to fulfill this speed limitation. *Since thread creation > > is (meant to be) slow and rare, this isn't much of a constraint in > > practice. > > If fixed the non-windows issues since my last post. *I'm still not > sure what to do about the thread recreation issue. *One solution would > be to use some sort of locking mechanism or atomic math, maybe > EnterCriticalSection or InterlockedIncrement on Win32/MSVC. *But I'd > like to stay very portable... I currently use no libraries aside from > the C++ standard library (and, on windows, kernel32.lib), and nothing > compiler or OS or ISA specific without ifdefs and fallback paths on > unrecognized systems (admittedly the fallbacks are not very good for > OSes that are neither windows nor *nix...). * Another solution would > be a malloc(1) on a per-thread basis, with no corresponding free. *I'm > not familiar with "guid", but it appears to refer to a variety of > different algorithms for generating 128 bit numbers intended to be > unique across process boundaries and often network boundaries. > Apparently I would call UuidCreateSequential on win32, I'm not sure > what I would do on *nix? *But that isn't really looking any better to > me in terms of portability or speed than the other options. *Though it > could replace the RtlGenRandom and/or /dev/random usage on program > start. A "guid" is a globally unique ID. Basically, it is some algorithm or another that allows you to create a number (128 bits or higher, usually) that is guaranteed to be unique. They use things like cycle count and network MAC address to do it. If you use the 128 bits to initialize part of the seed, that leaves many other bits for state- space for the rng to work. i.e. With the 128 bits, store into the upper 64 bit quantity in the LCG, and into the 64 bits of "nonce", in rng_hash_128. Set the lower 64 bits of the LCG to be zero. This initialization gives you 2^64 random numbers before the rng can get into a state that another instance could possibly be in. > > > > > Notice how it isn't using the adc instruction at all. *It also seems > > > > to be rather poor code. :-( *The above benchmarks as taking 5.8s to do > > > > a billion random numbers. *The asm version I gave takes 4.2s *Note > > > > that the fastest of the other rng's you gave takes 4.6s on this > > > > machine. *I don't begrudge the compile much though... even simple > > > > rearrangements of the asm instructions cause the time taken to varyby > > > > up to 30%. *It is very hard to come up with the right ordering to > > > > overlap memory accesses with computation. > > > > 4.6 seconds, ow! *I know performance at the high end gets weird, but > > > still I had hoped that at least one of the RNGs I posted would do > > > better. *Several of them reliably take less then 2 seconds for thaton > > > both my Core i5 2500 and my old Core2 3 GHrz using both 64 bit gcc and > > > less than twice that using 32 bit MSVC. *All w/o asm. > > > My computer is old and slow. *Performance is relative, remember. > > Yeah. *Sigh. *Even just looking at it as relative performance... *Is > your optimizer allowed to inline your stuff and/or remap register > names, or does it have to stick to calling conventions and/or the code > as you wrote it? I deliberately made it so it couldn't. The asm is in a separate file in my tests. The C equivalent is to use the noinline and noclone gcc attributes. If you allow inlining, then benchmarking becomes meaningless. The optimizer can do all sorts of things, including eliding parts of the rng entirely. :-/ Since "typical" use will be via a function call into a library, this isn't unrealistic. > *On my benchmarks, if I force it to not inline the > calls then they hit a performance maximum around 300 million calls per > second (though it varies quite a bit, from a low around 100 to a high > around 500, it's usually between 250 and 325), regardless of how fast > the RNG itself is. I don't see this. The difference between the slow and fast rng is quite pronounced here. There also is very little variability in the timings from one run to another. What does vary greatly is the effect of instruction ordering. If I rearrange a couple of instructions, the results can change by 20% or more. This makes optimizing a chore, as many combinations need to be tested in order to find the best. > *Even return 0 will get capped at that rate. *If I > let it inline, even then it will still hit a cap around 1.1 billion > calls per second even for very simple RNGs like "int random() {int tmp > = a; a += b; return tmp;}". *A simple "return 0;" is faster still, but > by so much that I suspect it optimizes away almost the entire calling > loop in that instance. That's the problem. You need to avoid letting the optimizer see the rng code when it makes the code for the benchmark loop. Otherwise the results are contaminated. > > > The following variant of that: > > > * * * * Word tmp = high; > > > * * * * tmp ^= tmp >> (OUTPUT_BITS / 2); > > > * * * * tmp *= K; > > > * * * * tmp ^= rotate(tmp ^ low, OUTPUT_BITS/3); > > > * * * * Word old = low; > > > * * * * low += K; > > > * * * * high += old + ((low < K) ? 1 : 0); > > > * * * * return tmp; > > > > ...does much better on statistical tests when used on 32 bit integers > > > instead of 64 bit integers. *For the full 64 bit versions there is no > > > way to distinguish quality as both that and the original pass all > > > tests I've tried (when the xored value wasn't low-entropy), but > > > probably the 32 bit version is representative. *It still has the > > > interstate correlation issues, but those are very hard to get rid of > > > on fast random access RNGs, and no worse than the ones in the CLCG I'm > > > looking to replace. *Adding a second multiply and another xorshift > > > gets the interstate correlation issue under control without hurting > > > performance intolerably - users who want performance faster than that > > > can damn well use a non-random-access RNG. *I might also try combining > > > successive elements like my hashed LFSR did to see if I can eliminate > > > the need for a 2nd multiply that way. *I'm thinking of adding a third > > > word of state doing something dumb to get the statespace a little > > > bigger, but at this point really it's looking good enough to be worth > > > replacing the CLCG. > > > I tried something much like the above, and it wasn't very good. *It > > either had poor randomness, or failed to be fast enough to bother > > with. > > That change had no performance impact on my computer (it's becoming > increasing apparent that we see very different performance impact from > changes), and did well on statistical tests when used on 32 bit > integers where the original (when changed to use 32 bit integers) did > poorly (IIRC failing after 5 minutes or so). *A few other tweaks got > it to the point of passing an 8 hour test on 32 bit, I didn't try any > longer tests. As I've said... even small changes to the orderings of instructions can give large relative changes in timing here. It is really annoying. > > > Okay... after quite a bit of effort, I've found something much, much > > better. > > > This is three different rng's in one: > > > .align 16 > > .globl rng_hash_128 > > .type *rng_hash_128,@function > > rng_hash_128: > > * * * * mov * *8(%rdi), %rax > > > * * * * movabs $0x6595a395a1ec531b, %rcx > > > # * Option 1) > > # * Non-threaded, fastest. *No xor instruction used. > > > # * Option 2) > > # * Threaded, use the address of the seed as a nonce. > > # * * *xor * *%rdi, %rax > > > # * Option 3) > > # * Threaded, pass a nonce as a second parameter. > > # * * * xor * *%rsi, %rax > > # *or have it stored in a larger seed: > > # * * *xor * *16(%rdi), %rax > > > * * * * mov * *(%rdi), %rsi > > * * * * mul * *%rcx > > > * * * * add * *%rcx, (%rdi) > > * * * * adc * *%rsi, 8(%rdi) > > > * * * * xor * *%rsi, %rax > > * * * * xor * *%rdx, %rax > > * * * * mul * *%rcx > > * * * * add * *%rsi, %rax > > * * * * add * *%rdx, %rax > > > * * * * retq > > .size * rng_hash_128, .-rng_hash_128 > > > For non-threaded use, pick the one without the extra xor instruction. > > This is fastest, taking 3.77s to output 10^9 64bit random numbers on > > this machine. > > I get about 2.5 GB/s when I stick that in as inline asm on gcc. *About > 300 million calls per second, about 10 cycles per call. *I used this > code: > Uint64 rv; > asm volatile ( > * * * * "mov * *8(%%rdi), %%rax \n" > * * * * "movabs $0x6595a395a1ec531b, %%rcx \n" > * * * * "mov * *(%%rdi), %%rsi \n" > * * * * "mul * *%%rcx \n" > * * * * "add * *%%rcx, (%%rdi) \n" > * * * * "adc * *%%rsi, 8(%%rdi) \n" > * * * * "xor * *%%rsi, %%rax \n" > * * * * "xor * *%%rdx, %%rax \n" > * * * * "mul * *%%rcx \n" > * * * * "add * *%%rsi, %%rax \n" > * * * * "add * *%%rdx, %%rax \n" > * * * * :"=a"(rv)//outputs > * * * * :"D"(this)//inputs > * * * * :"memory", "%rcx", "%rsi", "%rdx"//clobbers > ); > return rv; > If I left off the "volatile" keyword then the optimizer seemed to go > berserk on non-vtable calls to it for some reason, though vtable calls > remained the same. *Probably I screwed up the constraints somehow. Why on earth is a vtable required? Can't you just use a normal function call? That's what I do. The asm is compiled in a separate .s file, and then linked in later. > I also tried implementing it in C, but neither gcc nor MSVC will let > me use a 128 bit integer. *I did implement it in C by scaling it down > to 32 bit instead of 64 bit and compiling on a 32 bit compiler. *It > wasn't particularly fast that way relative to other 32 bit RNGs. Are you in 32 bit mode? You can use 128 bit integers in 64 bit mode with gcc. If you are using 32 bits, you are out of luck. MSVC has intrinsics which allow the use of 64x64->128 multiplies in 64 bit mode. > > > So by picking a very fast f(), with poor randomness, and a good-enough > > g() we can obtain a very fast rng that still passes stringent > > statistical tests. *The problem is finding a good fast hash. > > Fortunately, the extended-precision multiply used above works for g(). > > (A fast yet poor quality LCG easily works for f() ) > > > Note: when re-implementing the above, be careful with the instruction > > ordering. *Even small rearrangements can alter the speed by tens of > > percent. > > > Steven > > On the complex-output-functions = better-parallelism argument I'm > pretty meh. *Yeah, the output function can be done in parallel with > the state transition function. *Yeah, the output function can be done > in parallel with the next output function if the state transition is > fast enough and the calling code is nice enough and the scheduler is > psychic enough. *But there are severe limits to how much can be done > at once, both due to limits of execution resources (large integer > multiplication in particular takes a huge transistor & power budget) > and due to scheduling difficulty. *My understanding is that most CPUs > can't maintain a IPC more than 4 on any remotely real world code no > matter how parallel it is, far lower if lots of complex operations > like large integer multiplication are involved. *Notice how recent CPU > generations haven't really been getting significantly more > superscalar? *These days heat is the fundamental limiting factor, and > executing all those operations will produce the same amount of heat > regardless of how how smart the scheduler is... in fact a smarter > scheduler will just add overhead to the thermal budget, so the next > few CPU generations are not likely to emphasize OOO any more than the > last few have (in the desktop world anyway...). > > Also, most fast RNGs with simple output functions have highly parallel > state transition functions, typically 2 to 4 simple operations. *Your > code there is a critical path of 2 simple operations on the state > transition function, and 6 operations on the output function, two of > which are high latency instructions. *mul 64x64->128 is reported to be > 5 cycle latency on recent AMD, Intel's most recent optimization guide > seems to claim a latency of 14-18 cycles on that but probably I'm > misreading that somehow as it seems improbably high... a quick > googling suggests it might be closer to 3 cycles. *Since you have two > of them in your critical path I end up with a best-case performance on > the order of 6-10 cycles). *If the output never actually gets used, > that's a different story... the simple transition function might > dominate then. Be careful... I'm stating what I'm measuring, not what is theoretically happening. For example, if I remove the final multiply + xor sequence, the time taken doesn't change. It appears that the time taken to access L1 in the add, adc chain is much longer than naive cycle-counting would predict. Steven |
|
|||
|
On Jan 11, 4:59*pm, orz <cdh...@nospicedham.gmail.com> wrote:
> On Jan 11, 3:37*am, orz <cdh...@nospicedham.gmail.com> wrote: > > > Also, most fast RNGs with simple output functions have highly parallel > > state transition functions, typically 2 to 4 simple operations. *Your > > code there is a critical path of 2 simple operations on the state > > transition function, and 6 operations on the output function, two of > > which are high latency instructions. *mul 64x64->128 is reported to be > > 5 cycle latency on recent AMD, Intel's most recent optimization guide > > seems to claim a latency of 14-18 cycles on that but probably I'm > > misreading that somehow as it seems improbably high... a quick > > googling suggests it might be closer to 3 cycles. *Since you have two > > of them in your critical path I end up with a best-case performance on > > the order of 6-10 cycles). *If the output never actually gets used, > > that's a different story... the simple transition function might > > dominate then. > > I was a bit tired when I wrote that, lots of typos and mistakes though > I stand by the general concept. *The talk of latencies should have > mentioned throughputs and ratios of throughputs to latencies. *Intel > claims that mul 64x64->128 should have a latency of 4 to 10 (usually > 10) cycles on core 7 or 7 cycles on core 2, and a throughput of 1 > cycle on core 7 or 3 cycles on core 2. *The ratio of those latencies > to throughputs ranges from 2.3 to 10.0. *Which implies that executing > sequential output functions in parallel with each other is a > significant possibility for that RNG. *BTW, add-with-carry is listed > as having significantly worse latency and throughput than regular > add... add at 1 latency and 0.33 throughput on all core7s and core2s, > adc at 2 latency and 1 to 1.5 throughput, cmp+cmov+add at 4 latency, > 1.16 to 1.66 throughput (adding the throughputs of the individual > instructions, though technically that's not correct). > > If we assume that the scheduling is perfect and there's no extra > complications from the calling code or anything, then what matters is > the larger of the overall functions throughput and the state > transition functions critical path latency. *For that RNG, that looks > like 3 cycles latency for the state transition function on most of > Intel's desktop CPUs, and an overall throughput in the neighborhood of > ~5 cycles on core7 or ~8 cycles on core2. *Compare that to, say, the > RNG on Robert Jenkins small fast RNG design page, with a critical path > on the state transition function of 1 rotate and 2 additions, and a > grand total of 4 additions, 1 xor, and 2-3 rotates - a critical path > latency of 3 cycles and an overall throughput of a little less than 3 > cycles. > > Of course, that's all based upon massively simplified models of the > CPUs performance - reality rarely conforms to it. *But since our main > alternative seems to be benchmarks that yield very different results > on different computers... Measurement is easier: This machine is 2.4GHz Opteron. It takes 3.773s to complete 10^9 calls to the single-threaded version of the rng. 2.4 * 3.773 = 9.0552 Thus it takes 9 cycles per loop iteration. If you remove the loop overhead it'll be faster... Don't forget the L1 access latency in your cycle count prediction. Registers are much faster than L1 on modern machines. The L1 latency on this machine is 3 cycles. On I7, I'm fairly sure it is 4 cycles. Steven |
|
|||
|
On Jan 11, 6:56*pm, sfuerst <svfue...@nospicedham.gmail.com> wrote:
> > Yeah. *Sigh. *Even just looking at it as relative performance... *Is > > your optimizer allowed to inline your stuff and/or remap register > > names, or does it have to stick to calling conventions and/or the code > > as you wrote it? > > I deliberately made it so it couldn't. *The asm is in a separate file > in my tests. *The C equivalent is to use the noinline and noclone gcc > attributes. *If you allow inlining, then benchmarking becomes > meaningless. *The optimizer can do all sorts of things, including > eliding parts of the rng entirely. :-/ > > Since "typical" use will be via a function call into a library, this > isn't unrealistic. > > > *On my benchmarks, if I force it to not inline the > > calls then they hit a performance maximum around 300 million calls per > > second (though it varies quite a bit, from a low around 100 to a high > > around 500, it's usually between 250 and 325), regardless of how fast > > the RNG itself is. > > I don't see this. *The difference between the slow and fast rng is > quite pronounced here. *There also is very little variability in the > timings from one run to another. *What does vary greatly is the effect > of instruction ordering. *If I rearrange a couple of instructions, the > results can change by 20% or more. *This makes optimizing a chore, as > many combinations need to be tested in order to find the best. > > > *Even return 0 will get capped at that rate. *If I > > let it inline, even then it will still hit a cap around 1.1 billion > > calls per second even for very simple RNGs like "int random() {int tmp > > = a; a += b; return tmp;}". *A simple "return 0;" is faster still, but > > by so much that I suspect it optimizes away almost the entire calling > > loop in that instance. > > That's the problem. *You need to avoid letting the optimizer see the > rng code when it makes the code for the benchmark loop. *Otherwise the > results are contaminated. Another area where we've come from similar goals to rather different approaches. 1. My view is that inlining is in fact representative. MSVC will inline static library calls (if the information needed is present... which bloats the size of the .lib files by quite a bit), having them as separate files is insufficient. I believe Intel C can do that as well. gcc will not so far as I know, at the moment anyway, but that could change after a few more gcc releases, and even now it will still inline functions defined in the headers of libraries. I usually don't put the algorithms in the headers, but when the algorithm is implemented in a buffered fashion I do put the buffer reading portion in to the header with only the buffer refill logic hidden in the .cpp file. And for the simplest of algorithms, those that have very small code size, I've been considering moving their implementation in to the header. 2. To prevent the calls from being optimized away, I actually perform math and IO dependent upon the result of the calls. Admittedly very very simple math to prevent it from reducing the performance of the loop, and IO that is absurdly unlikely to happen to prevent it from cluttering up the output. Unfortunately, if the RNG is really REALLY simple (ie "return 0;" or somesuch - anything with a state transition function that doesn't do any transitioning) then the compiler can actually figure out the results of the math without calling the RNG repeatedly because all calls have the same result anyway. This is intended to be representative of an RNG user that demands maximum performance from an RNG - they need to do something with the result or they wouldn't be calling the RNG in the first place, but if they want the fastest possible RNG then they must be using the results for something with low performance impact. Unfortunately, gcc seems to have trouble with my loop when combined with inline asm, apparently doing the math incorrectly... probably I defined the constraints on the inline asm incorrectly somehow. 3. Like many RNG packages aimed at researchers, I support run-time RNG selection via polymorphic interfaces. Unlike them I also offer the same RNG algorithms by a non-polymorphic but otherwise identical interface for maximum performance and minimum size. I benchmark all RNG algorithms through both interfaces. If I limit myself to non-inline RNG calls then my best performance on this CPU drops down to about 3 seconds per billion calls. And your latest RNG comes in first out of the set of RNGs I currently have active in my benchmark program, which includes many of my fastest RNGs (though I've been optimizing more for inline performance than non- line, since the differences are pretty small in non-inline). Which ones are fastest in non-inline contexts is pretty weird in my testing, even weirder than inline RNG performance I think, and has very small performance differences. So basically, it looks like much of our performance difference is simply that you are measuring and optimizing for non-inline performance (representative for library calls on gcc), while I am mostly optimizing for inline performance (representative on some compilers, or if the core RNG algorithm is defined in a header) though I measure both. > > I also tried implementing it in C, but neither gcc nor MSVC will let > > me use a 128 bit integer. *I did implement it in C by scaling it down > > to 32 bit instead of 64 bit and compiling on a 32 bit compiler. *It > > wasn't particularly fast that way relative to other 32 bit RNGs. > > Are you in 32 bit mode? *You can use 128 bit integers in 64 bit mode > with gcc. *If you are using 32 bits, you are out of luck. *MSVC has > intrinsics which allow the use of 64x64->128 multiplies in 64 bit > mode. I tried both 64 and 32 bit gcc. And I tried consulting the docs. If I'm reading the docs correctly, the issue is that gcc does not support 128 bit integers in C++, only in C. WTF gcc? I suppose I could try creating a separate .c file just for sticking in RNGs with 128 bit integers and adding some glue logic to allow it to interface with my C+ + testing framework but... really, that doesn't sound that entertaining. Probably if I just wait a while someone will add 128 bit integer support to gcc. > > > > > So by picking a very fast f(), with poor randomness, and a good-enough > > > g() we can obtain a very fast rng that still passes stringent > > > statistical tests. *The problem is finding a good fast hash. > > > Fortunately, the extended-precision multiply used above works for g().. > > > (A fast yet poor quality LCG easily works for f() ) > > > > Note: when re-implementing the above, be careful with the instruction > > > ordering. *Even small rearrangements can alter the speed by tens of > > > percent. > > > > Steven > > > On the complex-output-functions = better-parallelism argument I'm > > pretty meh. *Yeah, the output function can be done in parallel with > > the state transition function. *Yeah, the output function can be done > > in parallel with the next output function if the state transition is > > fast enough and the calling code is nice enough and the scheduler is > > psychic enough. *But there are severe limits to how much can be done > > at once, both due to limits of execution resources (large integer > > multiplication in particular takes a huge transistor & power budget) > > and due to scheduling difficulty. *My understanding is that most CPUs > > can't maintain a IPC more than 4 on any remotely real world code no > > matter how parallel it is, far lower if lots of complex operations > > like large integer multiplication are involved. *Notice how recent CPU > > generations haven't really been getting significantly more > > superscalar? *These days heat is the fundamental limiting factor, and > > executing all those operations will produce the same amount of heat > > regardless of how how smart the scheduler is... in fact a smarter > > scheduler will just add overhead to the thermal budget, so the next > > few CPU generations are not likely to emphasize OOO any more than the > > last few have (in the desktop world anyway...). > > > Also, most fast RNGs with simple output functions have highly parallel > > state transition functions, typically 2 to 4 simple operations. *Your > > code there is a critical path of 2 simple operations on the state > > transition function, and 6 operations on the output function, two of > > which are high latency instructions. *mul 64x64->128 is reported to be > > 5 cycle latency on recent AMD, Intel's most recent optimization guide > > seems to claim a latency of 14-18 cycles on that but probably I'm > > misreading that somehow as it seems improbably high... a quick > > googling suggests it might be closer to 3 cycles. *Since you have two > > of them in your critical path I end up with a best-case performance on > > the order of 6-10 cycles). *If the output never actually gets used, > > that's a different story... the simple transition function might > > dominate then. > > Be careful... I'm stating what I'm measuring, not what is > theoretically happening. *For example, if I remove the final multiply > + xor sequence, the time taken doesn't change. *It appears that the > time taken to access L1 in the add, adc chain is much longer than > naive cycle-counting would predict. Yeah, reality trumps theory. But, real world RNG using code might use the output in some way that made the next call to the RNG dependent upon the previous ones results, from the schedulers perspective anyway. Or do something else to make latency more important relative to throughput. And, are you actually using the output for anything? The CPUs scheduler acts a little bit like the compilers optimizer... I don't *think* it can completely discard scheduled instructions that become meaningless due to later instructions, but maybe it's smarter than I think? And, as mentioned above, I consider inline performance at least as representative for small code size RNGs. |
|
|||
|
I've modified the xorshift with a period of 2^128-1,
adding two LFSR's with period 2^32-1 and 2^31-1 and combined them with a J-K style flip-flop. It passed the crush/bigcrush tests and is not too slow. Feedback is welcome! Pedro Pereira static uint32_t state_0 = 0xEA2CBD78; static uint32_t state_1 = 0x07BF10D5; static uint32_t state_2 = 0x5FA0B00A; static uint32_t state_3 = 0x06D98C3F; static uint32_t lfsr_0 = 0x29EB68FE; static uint32_t lfsr_1 = 0x4CD98AA6; static uint32_t lfsr_01 = 0x22C3407B; static uint32_t cycle = 0x5D8291F3; static inline uint32_t lfsr(uint32_t n, uint32_t m) { return (n >> 1) ^ (-(n & 1) & m); } uint32_t get_pseudo_random(void) { uint32_t save; save = state_0 ^ (state_0 << 13); state_0 = state_1; state_1 = state_2; state_2 = state_3; state_3 = state_3 ^ (state_3 >> 17) ^ (save ^ (save >> 7)); lfsr_0 = lfsr(lfsr_0, 0xE28F75ED); lfsr_1 = lfsr(lfsr_1, 0x400005B6); lfsr_01 = (lfsr_0 & ~lfsr_01) | (lfsr_1 & lfsr_01); cycle += 0x5D31995A; return (state_3 + lfsr_01) ^ cycle; } |
|
|||
|
On Jan 12, 10:25*am, Pedro Pereira <ppere...@nospicedham.grupopie.com>
wrote: > I've modified the xorshift with a period of 2^128-1, > adding two LFSR's with period 2^32-1 and 2^31-1 and > combined them with a J-K style flip-flop. > > It passed the crush/bigcrush tests and is not too slow. > > Feedback is welcome! > > Pedro Pereira > > static uint32_t state_0 = 0xEA2CBD78; > static uint32_t state_1 = 0x07BF10D5; > static uint32_t state_2 = 0x5FA0B00A; > static uint32_t state_3 = 0x06D98C3F; > static uint32_t lfsr_0 = 0x29EB68FE; > static uint32_t lfsr_1 = 0x4CD98AA6; > static uint32_t lfsr_01 = 0x22C3407B; > static uint32_t cycle = 0x5D8291F3; > > static inline uint32_t lfsr(uint32_t n, uint32_t m) > { > * return (n >> 1) ^ (-(n & 1) & m); > > } > > uint32_t get_pseudo_random(void) > { > * uint32_t save; > > * save = state_0 ^ (state_0 << 13); > * state_0 = state_1; > * state_1 = state_2; > * state_2 = state_3; > * state_3 = state_3 ^ (state_3 >> 17) ^ (save ^ (save >> 7)); > > * lfsr_0 = lfsr(lfsr_0, 0xE28F75ED); > * lfsr_1 = lfsr(lfsr_1, 0x400005B6); > * lfsr_01 = (lfsr_0 & ~lfsr_01) | (lfsr_1 & lfsr_01); > > * cycle += 0x5D31995A; > * return (state_3 + lfsr_01) ^ cycle; > > > > > > > > } The first LFSR is pretty good for an LFSR. Very good for an LFSR of its state size. Not great in any absolute sense though - it's still an LFSR. The two small LFSRs are extremely low quality. The way they are combined is rather unusual, but doesn't help on my tests. The overall function is pretty slow. I could run a standard cryptographic RNG instead and have faster output. Broken down as return values vs data and/or CPU time to fail my testing regime state_3: 32 GB (8 minutes) lfsr_0: 4 KB (< 1 second) lfsr_1: 1 KB (the smallest unit my tests will even think of operating on) lfsr_01: 8 KB (< 1 second) cycle: 1 KB (the smallest unit my tests will even think of operating on) lfsr_0 ^ cycle: 16 MB (< 1 second) lfsr_1 ^ cycle: 2 MB (< 1 second) lfsr_01 ^ cycle: 8 MB (< 1 second) state_3 ^ cycle: passed 128 GB so far (half an hour) state_3 + lfsr0: not yet tested state_3 + lfsr1: not yet tested state_3 + lfsr01: not yet tested (state_3 + lfsr01) ^ cycle: passed 256 GB so far (an hour) I've got 2 cores dedicated to testing variants atm, I'll post again when they've filled out the rest of that table. |
|
|||
|
On Jan 12, 12:04*pm, orz <cdh...@nospicedham.gmail.com> wrote:
> On Jan 12, 10:25*am, Pedro Pereira <ppere...@nospicedham.grupopie.com> > wrote: > > > > > > > > > > > I've modified the xorshift with a period of 2^128-1, > > adding two LFSR's with period 2^32-1 and 2^31-1 and > > combined them with a J-K style flip-flop. > > > It passed the crush/bigcrush tests and is not too slow. > > > Feedback is welcome! > > > Pedro Pereira > > > static uint32_t state_0 = 0xEA2CBD78; > > static uint32_t state_1 = 0x07BF10D5; > > static uint32_t state_2 = 0x5FA0B00A; > > static uint32_t state_3 = 0x06D98C3F; > > static uint32_t lfsr_0 = 0x29EB68FE; > > static uint32_t lfsr_1 = 0x4CD98AA6; > > static uint32_t lfsr_01 = 0x22C3407B; > > static uint32_t cycle = 0x5D8291F3; > > > static inline uint32_t lfsr(uint32_t n, uint32_t m) > > { > > * return (n >> 1) ^ (-(n & 1) & m); > > > } > > > uint32_t get_pseudo_random(void) > > { > > * uint32_t save; > > > * save = state_0 ^ (state_0 << 13); > > * state_0 = state_1; > > * state_1 = state_2; > > * state_2 = state_3; > > * state_3 = state_3 ^ (state_3 >> 17) ^ (save ^ (save >> 7)); > > > * lfsr_0 = lfsr(lfsr_0, 0xE28F75ED); > > * lfsr_1 = lfsr(lfsr_1, 0x400005B6); > > * lfsr_01 = (lfsr_0 & ~lfsr_01) | (lfsr_1 & lfsr_01); > > > * cycle += 0x5D31995A; > > * return (state_3 + lfsr_01) ^ cycle; > > > } > > The first LFSR is pretty good for an LFSR. *Very good for an LFSR of > its state size. *Not great in any absolute sense though - it's still > an LFSR. > > The two small LFSRs are extremely low quality. *The way they are > combined is rather unusual, but doesn't help on my tests. > > The overall function is pretty slow. *I could run a standard > cryptographic RNG instead and have faster output. > > Broken down as return values vs data and/or CPU time to fail my > testing regime > state_3: 32 GB (8 minutes) > lfsr_0: 4 KB (< 1 second) > lfsr_1: 1 KB (the smallest unit my tests will even think of operating > on) > lfsr_01: 8 KB (< 1 second) > cycle: 1 KB (the smallest unit my tests will even think of operating > on) > lfsr_0 ^ cycle: 16 MB (< 1 second) > lfsr_1 ^ cycle: 2 MB (< 1 second) > lfsr_01 ^ cycle: 8 MB (< 1 second) > state_3 ^ cycle: passed 128 GB so far (half an hour) > state_3 + lfsr0: not yet tested > state_3 + lfsr1: not yet tested > state_3 + lfsr01: not yet tested > (state_3 + lfsr01) ^ cycle: passed 256 GB so far (an hour) > > I've got 2 cores dedicated to testing variants atm, I'll post again > when they've filled out the rest of that table. It's looking like state_3 combined with anything else passes all my tests. I am not fan of combining multiple RNGs in that way. I would rather have every bit of state either effect or be effected by every other bit of state, or both. When the RNG can be divided in to independent component RNGs that tends to mean both decreased overall quality and increased inter-state/seed correlation relative to the quality. I want to back away from my earlier statement "I could run a standard cryptographic RNG instead and have faster output". Really, the common cryptographic RNGs are pretty slow in software - to find one faster than this I have to basically cheat, picking ones that use larger integer sizes and are not well accepted as secure. And while your RNG is kinda slow in my book, it's still faster than mt19937 on gcc (though slower on MSVC), so there's plenty of people who wouldn't consider it slow - I have pretty high standards in that regard. Also, leaving out lfsr_1 and lfsr_01 has, I think, fairly little cost in output quality or statespace, and yields a significant improvement in speed. Though it would substantially worsen interstate correlation issues I think. BTW, how did you choose the constants for the state_* LFSR? I'm familiar with methods of finding maximal period constants and have a few rules of thumb / common sense to avoid bad constants, but after that I mostly stick to either empirical testing or avalanche-style testing to distinguish good constants from bad constants. If I had a better feel for where the constants came from I might scale it down to working on 16 or 8 bit integers for testing purposes, as it's a lot easier to find flaws when corner cases are more frequent and the state size is closer to the log of the number of bins in my chi squared tests. |
|
|||
|
orz wrote:
[...] > It's looking like state_3 combined with anything else passes all my > tests. > > I am not fan of combining multiple RNGs in that way. I would rather > have every bit of state either effect or be effected by every other > bit of state, or both. When the RNG can be divided in to independent > component RNGs that tends to mean both decreased overall quality and > increased inter-state/seed correlation relative to the quality. > > I want to back away from my earlier statement "I could run a standard > cryptographic RNG instead and have faster output". Really, the common > cryptographic RNGs are pretty slow in software - to find one faster > than this I have to basically cheat, picking ones that use larger > integer sizes and are not well accepted as secure. And while your RNG > is kinda slow in my book, it's still faster than mt19937 on gcc > (though slower on MSVC), so there's plenty of people who wouldn't > consider it slow - I have pretty high standards in that regard. Also, > leaving out lfsr_1 and lfsr_01 has, I think, fairly little cost in > output quality or statespace, and yields a significant improvement in > speed. Though it would substantially worsen interstate correlation > issues I think. Using a block cipher was the advantage of being easy to create a random access pseudo random number generator (just use it in counter mode). I tried to pick one, reduce the number of rounds and overall operations, but I wasn't able to make it fast enough. > > BTW, how did you choose the constants for the state_* LFSR? I'm > familiar with methods of finding maximal period constants and have a > few rules of thumb / common sense to avoid bad constants, but after > that I mostly stick to either empirical testing or avalanche-style > testing to distinguish good constants from bad constants. If I had a > better feel for where the constants came from I might scale it down to > working on 16 or 8 bit integers for testing purposes, as it's a lot > easier to find flaws when corner cases are more frequent and the state > size is closer to the log of the number of bins in my chi squared > tests. I'm using the same method. Pedro Pereira |
|
|||
|
Hi,
playing with a custom MWC algo http://en.wikipedia.org/wiki/Multiply-with-carry i have found something better (most times) than that from Steven, using a reduced vector of 32x8 bytes. the main difference with that of Steven is that here you can optimize or customize it without variating to much from the main performances it offers. it is called /pi_rand/ because i use a sample from pi as vector usage: mov rcx,362436 xor edx,edx call .pi_randA ..pi_randA: ;--- RSI point to a vector minimum EDX*8 ;--- in ECX c ;--- in EDX i ;--- out EAX,ECX,EDX ;--- preserve ECX,EDX ;-------------------------- ;--- i = (i + 1) & 4095; ;--- customize: RSI points to a vector ;--- large at least EDX*8 inc edx and edx,01Fh ;--- 4095 original mov rax,[rsi+rdx] ;--- t = a * Q[i] + c; imul rax,18782 ror rax,11 ;<--- customize here or comment add rax,rcx add rcx,rax ;--- c = (t >> 32); shr rcx,32 add eax,ecx ;--- x = t + c; cmp eax,ecx ;--- if (x < c) { sbb r8,r8 neg r8 adc ecx,0 ;--- c++; add eax,r8d ;--- x++; ;--- (Q[i] = r - x); sub eax,-2 ;--- sub x,r neg eax ;--- neg x mov [rsi+rdx],eax ;--- return (Q[i] = r - x); ret i cannot attach a plot because gnuplot runs out of memory after 8 MiB binary data (sic!), when i need at least 16. gnuplot> load "E:/gnuplot/demo/mydemoB.dem" "E:/gnuplot/demo/mydemoB.dem", line 6: out of memory f or expanding curve points and,as far as it seems, the supporters too refuse categorically to face that serious problem again; they discussed about it in the remote 2009 A.D. http://groups.google.com/group/comp....c0388d96bddbb5 and their bugged philosopy seems to be: "gnuplot is not a data analysis tool, it is a plotting program" oh,... ....really ? :-) i wonder how could they start/continue coding a plotting program without considering data,and how huge the bulk of datas may be. Cheers, ..:mrk[hopcode] .:x64lab:. group http://groups.google.com/group/x64lab site http://sites.google.com/site/x64lab |
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|