parallel processing - Why launch a multiple of 32 number of threads in CUDA? -


CUDA took a course in parallel programming and I have seen many examples of CUDA thread configuration, where it is normal to round. The number of threads is required for the nearest multiple of 32. I think the thread is divided into Valps, and if you launch 1000 threads, then the GPU will run it any way up to 1024, so why is it obviously?

The advice is usually given in the context of those situations, where you can solve the same problem Possibly you can choose different threadbok sizes.

Let's add the vector as an example. Suppose that my vector length is 100000. I can choose to launch this 100 blocks of 1000 threads. In this case, there will be 1000 active threads in each block, and there will be 24 passive threads. My average use of thread resources is 1000/1024 = 97.6%.

Now, what if I chose 1024-sized blocks? Now I just need to launch 98 blocks. The first 97 of these blocks is fully used in the case of thread usage - each thread is useful. In the 98th block, there are only 672 (out of 1024) threads which are useful to other thread checks ( If idle (idx ) or other kernel code is clearly inactive due to construction. So I have 352 passive threads in a block. But my absolute average use is 100000/100352 = 99.6%

So in the above scenario, it is better to select the "full" threadbelt, equal to 32 divide by.

If you are adding vector lengths to a vector of 1000, and you want to do it in a threadblock, (both can be bad thoughts), it does not matter if you Threadbalk size selects 1000 or 1024.


Comments

Popular posts from this blog

mysql - How to enter php data into a html multiple select box -

java - Can't add JTree to JPanel of a JInternalFrame -

c++ - Cassandra datastax cpp driver - avoiding unnecessary copies -