cuda - Stream scheduling order -
the way see both process 1 & process 2 (below), equivalent in take same amount of time. wrong?
allofdata_a= data_a1 + data_a2 allofdata_b= data_b1 + data_b2 allofdata_c= data_c1 + data_c2 data_c output of kernel operation of both data_a & data_b. (like c=a+b) hw supports 1 deviceoverlap (concurrent) operation.
process one:
memcpyasync data_a1 stream1 h->d memcpyasync data_a2 stream2 h->d memcpyasync data_b1 stream1 h->d memcpyasync data_b2 stream2 h->d samekernel stream1 samekernel stream2 memcpyasync result_c1 stream1 d->h memcpyasync result_c2 stream2 d->h
process two: (same operation, different order)
memcpyasync data_a1 stream1 h->d memcpyasync data_b1 stream1 h->d samekernel stream1 memcpyasync data_a2 stream2 h->d memcpyasync data_b2 stream2 h->d samekernel stream2 memcpyasync result_c1 stream1 d->h memcpyasync result_c2 stream2 d->h
using cuda streams allows programmer express work dependencies putting dependent operations in same stream. work in different streams independent , can executed concurrently.
on gpus without hyperq (compute capability 1.0 3.0) can false dependencies because work dma engine or computation gets set single hardware pipe. compute capability 3.5 brings hyperq allows multiple hardware pipes , there shouldn't false dependencies. simplehyperq illustration illustrates this, , documentation shows diagrams explain going on more clearly.
putting simply, on devices without hyperq need breadth-first launch of work maximum concurrency, whereas devices hyperq can depth-first launch. avoiding false dependencies pretty easy, not having worry easier!
cuda stream
No comments:
Post a Comment