Cudaminer October 10th 2013 alpha


SUBMITTED BY: Guest

DATE: Nov. 17, 2013, 4:53 p.m.

FORMAT: Text only

SIZE: 11.8 kB

HITS: 1668

  1. CudaMiner release October 10th 2013 - alpha release
  2. ---------------------------------------------------
  3. this is a CUDA accelerated mining application for litecoin only.
  4. The most computationally heavy parts of the scrypt algorithm (the
  5. Salsa 20/8 iterations) are run on the GPU.
  6. You should see a notable speed-up compared to OpenCL based miners.
  7. Some numbers from my testing:
  8. GTX 260: 44 kHash/sec (maybe more on Linux/WinXP)
  9. GTX 640: 40 kHash/sec
  10. GTX 460: 100 kHash/sec
  11. GTX 560Ti: 140 kHash/sec
  12. GTX 660Ti: 176 kHash/sec (or 225 kHash/sec on the 448 core edition)
  13. NOTE: Compute 1.0 through 1.3 devices seem to run faster on Windows XP
  14. or Linux.
  15. Your nVidia cards will now suck a little less for mining! This tool
  16. will automatically use all nVidia GPUs found in your system, but the
  17. used device count can be limited to a lower number using the "-t"
  18. option, or even selected individually with the "-d" option
  19. This code is based on the pooler cpuminer 2.3.2 release and inherits
  20. its command line interface and options.
  21. Additional command line options are:
  22. --no-autotune disables the built-in autotuning feature for
  23. maximizing CUDA kernel efficiency and uses some
  24. heuristical guesswork, which might not be optimal.
  25. --devices [-d] gives a list of CUDA device IDs to operate on.
  26. Device IDs start counting from 0!
  27. --launch-config [-l] specify the kernel launch configuration per device.
  28. This replaces autotune or heuristic selection. You can
  29. pass the strings "auto" or just a kernel prefix like
  30. L or F or K or T to autotune for a specific card generation
  31. or a kernel prefix plus a lauch configuration like F28x8
  32. if you know what kernel runs best (from a previous autotune).
  33. --interactive [-i] list of flags (0 or 1) to enable interactive
  34. desktop performance on individual cards. Use this
  35. to remove lag at the cost of some hashing performance.
  36. Do not use large launch configs for devices that shall
  37. run in interactive mode - it's best to use autotune!
  38. --texture-cache [-C] list of flags (0 or 1 or 2) to enable use of the
  39. texture cache for reading from the scrypt scratchpad.
  40. 1 uses a 1D cache, whereas 2 uses a 2D texture layout.
  41. Cached operation has proven to be slightly faster than
  42. noncached operation on most GPUs.
  43. --single-memory [-m] list of flags (0 or 1) to make the devices
  44. allocate their scrypt scratchpad in a single,
  45. consecutive memory block. On Windows Vista, 7/8
  46. this may lead to a smaller memory size being used.
  47. When using the texture cache this option is implied.
  48. --hash-parallel [-H] 1 to enable parallel hashing on the CPU. May
  49. use more CPU but distributes hashing load neatly
  50. across all CPU cores. Use 0 otherwise, which is now
  51. the default.
  52. >>> Example command line options, advanced use <<<
  53. cudaminer.exe -H 1 -d 0,1,2 -i 1,0,0 -l auto,F27x3,K28x4 -C 0,2,1
  54. -o stratum+tcp://coinotron.com:3334 -O workername:password
  55. The option -H 1 distributes the CPU load across all available cores.
  56. I instruct cudaminer to use devices 0,1 and 2. Because I have the display
  57. attached to device 0, I set that device to run in interactive mode so
  58. it is fully responsive for desktop use while mining.
  59. Device 0 performs autotune and runs in interactive mode because I explicitly
  60. set the launch config to auto and the corresponding interactive flag is 1.
  61. Device 1 will use kernel launch configuration F27x3 (for Fermi) and device 2
  62. uses K28x4 (for Kepler) both in non-interactive mode.
  63. I turn on the use of the texture cache to 2D for device 1, 1D for device
  64. 2 and off for devices 0.
  65. The given -o/-O settings mine on the coinotron pool using the stratum
  66. protocol.
  67. >>> Additional Notes <<<
  68. The HMAC SHA-256 parts of scrypt are still executed on the CPU, and so
  69. any BitCoin mining will NOT be GPU accelerated. This tool is for LTC.
  70. There is also some SHA256 hashing required to do Scrypt hashing, and
  71. this part is also done by the CPU, in parallel to the work done by
  72. the GPU(s).
  73. This code should be fine on nVidia GPUs ranging from compute
  74. capability 1.1 up to compute capability 3.5. The Geforce Titan has
  75. received experimental and untested support.
  76. To see what autotuning does, enable the debug option (-D) switch.
  77. You will get a table of kHash/s for a variety of launch configurations.
  78. You may only want to do this when running on a single GPU, otherwise
  79. the autotuning output of multiple cards will mix.
  80. >>> RELEASE HISTORY <<<
  81. - the November 1st release finally fixes the stratum protocol
  82. hang for good. Root cause analysis: The ssize_t didn't wasn't
  83. a signed type in my Windows port, causing the stratum_send_line
  84. function to enter an infinite loop whenever the connection was
  85. lost, while holding the socket mutex.
  86. - the October 10th release may fix some infrequent hanging in
  87. the stratum protocol. Or maybe not. Please test.
  88. I also turned of the parallel SHA256 computations on the CPU
  89. because they seem to load the CPU a little more in most cases.
  90. use -H 1 to get the previous behavior.
  91. - the July 13th release adds support for the Stratum protocol,
  92. by making a fresh fork of pooler's cpuminer code (and any future
  93. updates of pooler's code will be easier to integrate).
  94. - the April 30th release fixes a minor problem in the CPU SHA-256
  95. parallelization that might have lead to inflated CPU use.
  96. Modified the CUDA API command issue order to get 99-100%
  97. utilization out of my Kepler card on Windows.
  98. The old "S" kernels have been replaced with kernels that seem
  99. to slightly improve performance on Kepler cards. Just prepend
  100. your previous Kepler launch config (e.g. 28x8) with an S prefix
  101. to see if you get any performance gains. Works for me! ;)
  102. - the April 22th release fixes Linux 64 bit compilation and reintroduces
  103. memory access optimizations in the Titan kernel.
  104. - the April 17th release fixes the texture cache feature (yay!) but
  105. the even Kepler cards currently see no real benefits yet (boo!).
  106. Ctrl-C will now also interrupt the autotuning loop, and pressing
  107. Ctrl-C a second time will always result in a hard exit.
  108. The Titan kernel was refactored into a write-to-scratchpad phase and
  109. into a read-from-scratchpad case using const __restrict__ pointers,
  110. which makes the Titan automatically use the 48kb texture cache in each
  111. SMX during the read phase. No need to use the -C flag with Titan.
  112. CPU utilization seems lower than in previous releases, especially in
  113. interactive mode. In fact I barely see cudaminer.exe consuming CPU
  114. resources all ;)
  115. - the April 14th release lowers the CPU use dramatically. I also fixed the
  116. Windows specific driver crash on CTRL-C problem. You still should not
  117. click the close button on the DOS box, as this does not leave the
  118. program enough time for cleanly shutting down.
  119. - the April 13th release turns the broken texture cache feature OFF by
  120. default, as now also seems detrimental to performance. So what remains of
  121. yesterday's update is just the interactive mode and the restored
  122. Geforce Titan support.
  123. I also added a validation of GPU results by the CPU.
  124. - the April 12th update boosts Kepler performance by 15-20% by enabling
  125. the texture cache on these devices to do its scrypt scratchpad lookups.
  126. You can also override the use of the texture cache from command line.
  127. I also add an interactive mode for cards that drive monitors, so you
  128. can be almost lag-free when using the desktop. It costs some performance
  129. though. In interactive mode autotuning, smaller kernel launch configs
  130. are selected. Try not to override this with huge launch configs, or the
  131. effect of interactive mode would be negated.
  132. Put Titan support back to its original state. I suspect that a CUDA
  133. compiler bug made the kernel crash when I applied the same optimizations
  134. that work so nicely on Compute 1.0 trough 3.0 devices.
  135. - the April 10th update speeds up the CUDA kernels SIGNIFICANTLY by using
  136. larger memory transactions (yay!!!)
  137. - the April 9th update fixes an autotune problem and adds Linux autotools
  138. support.
  139. - the April 8th release adds CUDA kernel optimizations that may get up to
  140. 20% more kHash out of newer cards (Fermi generation and later...).
  141. It also adds UNTESTED Geforce Titan support.
  142. I also use Microsoft's parallel patterns library to split up the CPU
  143. HMAC SHA256 workload over several CPU cores. This was a limiting factor
  144. for some GPUs before.
  145. - the April 6th release adds an auto-tuning feature that determines the
  146. best kernel launch configuration per GPU. It takes up to a few minutes
  147. while the GPU's memory and host CPU may be pegged a bit. You can disable
  148. this tuning with the --no-autotune switch
  149. - April 4th initial release.
  150. >>> About CUDA Kernels <<<
  151. CUDA kernels do the computation. Which one we select and in which
  152. configuration it is run greatly affects performance. CUDA kernel
  153. launch configurations are given as a character string, e.g. F27x3
  154. prefix blocks x warps
  155. Available kernel prefixes are:
  156. L - Legacy cards (compute 1.x)
  157. F - Fermi cards (Compute 2.x)
  158. K - Kepler cards (Compute 3.0). The letter S (for "spinlock") also works
  159. T - Titan and GK208 based cards (Compute 3.5)
  160. Examples:
  161. e.g. L27x3 is a launch configuration that works well on GTX 260
  162. F28x4 is a launch configuration that works on Geforce GTX 460
  163. K290x2 is a launch configuration that works on Geforce GTX 660Ti
  164. You should wait through autotune to see what kernel is found best for
  165. your current hardware configuration. You can also override the autotune's
  166. automatic device generation selection, e.g. pass
  167. -l F
  168. or
  169. -l K
  170. or
  171. -l T
  172. in order to autotune the Fermi kernel on a Legacy, Kepler or Titan device
  173. >>> TODO <<<
  174. Usability Improvements:
  175. - add reasonable error checking for CUDA API calls
  176. - a compiled 64 bit version also for Windows
  177. - add failover support between different pools
  178. - investigate why on some machine the legacy kernel fails,
  179. and on other machines the Fermi kernel fails.
  180. Further Optimization:
  181. - consider use of some inline assembly in CUDA
  182. - investigate benefits of a LOOKUP_GAP implementation
  183. - optimize more for compute 3.5 devices like newer GT640 cards
  184. and the Geforce Titan.
  185. ***************************************************************
  186. If you find this tool useful and like to support its continued
  187. development, then consider a donation in LTC.
  188. The donation address is LKS1WDKGED647msBQfLBHV3Ls8sveGncnm
  189. ***************************************************************
  190. Source code is included to satisfy GNU GPL V2 requirements.
  191. With kind regards,
  192. Christian Buchner ( Christian.Buchner@gmail.com )

comments powered by Disqus