20101207, 15:15  #1 
Dec 2010
10_{8} Posts 
Fast Mersenne Testing on the GPU using CUDA
I'd like to announce the implementation of a LucasLehmer tester, gpuLucas, written in CUDA and running on Fermiclass NVidia cards. It's a full implementation of Crandall's IBDWT method and uses balanced integers and a few little tricks to make it fast on the GPU.
Example timing: demonstrated primality of M_{42643801} in 57.86 hours, at a rate of 4.88 msec per Lucas product. This used a DWT runlength of 2,359,296 = 2^{18}*3^{2}, taking advantage of good efficiency for CUFFT runlengths of powers of small primes. Maximum error was 1.8e1. gpuLucas has been tested on GTX 480 and Tesla 2050 cards; there's actually very little difference in runtimes between the two...fears of a performance hit due to slow floating point on the 480 are bogusit's a wicked fast card for the GPGPU stuff; you get an additional 32 CUDA cores in place of the faster double precision, and it's clocked much faster than the Tesla. The Tesla only really shines when you overclock the heck out of it; I ran it up to 1402 Mhz for the above test, at which point it is 1520% faster than the GTX for the big Mersenne numbers. (It depends on the FFT length, though, and when the greater number of processors on the GTX are offset by slower double precision, which is only used in the FFTs anyway.) Finishing off a paper on the topic, and will post a preprint here in a week or so. I'll make the code available publicly as well, and maybe set up a tutorial webpage if folks are interested and if time permits. 
20101207, 20:14  #2 
Jul 2009
Tokyo
2·5·61 Posts 
Hi ,Andrew Thall
Congratulations ! 
20101207, 20:39  #3 
Account Deleted
"Tim Sorbera"
Aug 2006
San Antonio, TX USA
3×1,423 Posts 
When they use the same FFT lengths, how does the speed of this program compare to MacLucasFFTW? In any case, the flexibility of having nonpowerof2 FFTs makes it a very attractive choice compared to MacLucasFFTW.
Last fiddled with by MiniGeek on 20101207 at 20:40 
20101207, 22:47  #4 
Aug 2006
3·1,993 Posts 

20101208, 00:50  #5 
6809 > 6502
"""""""""""""""""""
Aug 2003
101×103 Posts
2^{2}×3×5×167 Posts 
A verification run in 3 days!?!?!

20101208, 00:57  #6 
Dec 2009
Peine, Germany
331 Posts 
Sounds great
We are very interested. I would buy a GTX 460 just for running your program. ;) Verification in 3 days? Wow. What would CUDALucas have needed?
Last fiddled with by Brain on 20101208 at 00:58 
20101208, 01:21  #7 
Jul 2009
Tokyo
2·5·61 Posts 

20101208, 02:04  #8 
Bemusing Prompter
"Danny"
Dec 2002
California
2×17×71 Posts 
I'm usually a bit leery when a brand new user makes such a bold claim; after all, we do get a fair share of trolls and cranks here (for example, someone recently claimed to have written an OpenCLenabled siever but following up after his second post).
However, I am 99% sure that this is legit because the OP in this thread seems to know what he is talking about. If the "gpuLucas" really works as claimed, it will greatly benefit the GIMPS community. Last fiddled with by ixfd64 on 20101208 at 02:06 
20101208, 02:14  #9 
Nov 2009
2·5^{2}·7 Posts 

20101208, 02:52  #10  
Bemusing Prompter
"Danny"
Dec 2002
California
2·17·71 Posts 
Quote:
No offense to msft, but it looks like that CUDALucas just got owned! 

20101208, 03:17  #11 
Jul 2009
Tokyo
2·5·61 Posts 

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
mfaktc: a CUDA program for Mersenne prefactoring  TheJudger  GPU Computing  3509  20211022 11:54 
Do normal adults give themselves an allowance? (...to fast or not to fast  there is no question!)  jasong  jasong  35  20161211 00:57 
Find Mersenne Primes twice as fast?  Derived  Number Theory Discussion Group  24  20160908 11:45 
TPSieve CUDA Testing Thread  Ken_g6  Twin Prime Search  52  20110116 16:09 
Fast calculations modulo small mersenne primes like M61  Dresdenboy  Programming  10  20040229 17:27 