gpu - CUDA Block parallelism -
i writing code in cuda , little confused run parallel.
say calling kernel function this: kenel_foo<<<a, b>>>
. per device query below, can have maximum of 512 threads per block. guaranteed have 512 computations per block every time run kernel_foo<<<a, 512>>>
? says here 1 thread runs on 1 cuda core, means can have 96 threads running concurrently @ time? (see device_query below).
i wanted know blocks. every time phone call kernel_foo<<<a, 512>>>
, how many computations done in parallel , how? mean done 1 block after other or blocks parallelized too? if yes, how many blocks can run 512 threads each in parallel? says here 1 block run on 1 cuda sm, true 12 blocks can run concurrently? if yes, each block can have maximum of how many threads, 8, 96 or 512 running concurrently when 12 blocks running concurrently? (see device_query below).
another question if a
had value ~50, improve launch kernel kernel_foo<<<a, 512>>>
or kernel_foo<<<512, a>>>
? assuming there no thread syncronization required.
sorry, these might basic questions, it's kind of complicated... possible duplicates: streaming multiprocessors, blocks , threads (cuda) how cuda blocks/warps/threads map onto cuda cores?
thanks
here's device_query
:
device 0: "quadro fx 4600" cuda driver version / runtime version 4.2 / 4.2 cuda capability major/minor version number: 1.0 total amount of global memory: 768 mbytes (804978688 bytes) (12) multiprocessors x ( 8) cuda cores/mp: 96 cuda cores gpu clock rate: 1200 mhz (1.20 ghz) memory clock rate: 700 mhz memory bus width: 384-bit max texture dimension size (x,y,z) 1d=(8192), 2d=(65536,32768), 3d=(2048,2048,2048) max layered texture size (dim) x layers 1d=(8192) x 512, 2d=(8192,8192) x 512 total amount of constant memory: 65536 bytes total amount of shared memory per block: 16384 bytes total number of registers available per block: 8192 warp size: 32 maximum number of threads per multiprocessor: 768 maximum number of threads per block: 512 maximum sizes of each dimension of block: 512 x 512 x 64 maximum sizes of each dimension of grid: 65535 x 65535 x 1 maximum memory pitch: 2147483647 bytes texture alignment: 256 bytes concurrent re-create , execution: no 0 re-create engine(s) run time limit on kernels: yes integrated gpu sharing host memory: no back upwards host page-locked memory mapping: no concurrent kernel execution: no alignment requirement surfaces: yes device has ecc back upwards enabled: no device using tcc driver mode: no device supports unified addressing (uva): no device pci bus id / pci location id: 2 / 0
check out this answer first pointers! reply little out of date in talking older gpus compute capability 1.x, matches gpu in case. newer gpus (2.x , 3.x) have different parameters (number of cores per sm , on), 1 time understand concept of threads , blocks , of oversubscribing hide latencies changes easy pick up.
also, take this udacity course or this coursera course going.
cuda gpu nvidia
No comments:
Post a Comment