gpgpu - OpenCL abstraction and actual hardware -
i'm trying develop improve intuition of mapping between opencl's abstraction , actual hardware. instance, using late-2011 macbook pro's configuration:
1)
radeon 6770m gpu: http://www.amd.com/us/products/notebook/graphics/amd-radeon-6000m/amd-radeon-6700m-6600m/pages/amd-radeon-6700m-6600m.aspx#2
"480 stream processors" guess of import number there.
2)
on other hand opencl api gives me these numbers:
device_name = ati radeon hd 6770m driver_version = 1.0 device_vendor = amd device_version = opencl 1.1 device_max_compute_units = 6 device_max_clock_frequency = 675 device_global_mem_size = 1073741824 device_local_mem_size = 32768 cl_device_address_bits = 32 cl_device_global_mem_cache_size = 0 cl_device_global_mem_cacheline_size = 0 cl_device_max_constant_buffer_size = 65536 cl_device_max_work_group_size = 1024 cl_device_max_work_item_dimensions = 3 cl_device_max_work_item_sizes = (1024, 1024, 1024)
and querying work grouping size , multiple trivial kernel (pass-through float4 form input output global mem)
cl_kernel_preferred_workgroup_size_multiple = 64 cl_kernel_work_group_size = 256
3)
the opencl specification states entire work grouping must able run concurrently on device's compute unit.
4)
opencl give device's simd-width through multiple, 64 in above case.
somehow cannot set "6" "480" , powers of 2 in relationship. if number of compute units 6 , simd width 64 384.
can explain how these numbers relate, hardware?
in gpu, each "compute unit" core executing 1 or more work-groups.
the max size of each work-group 256 specific kernel (obtained clgetkernelworkgroupinfo). can less if kernel requires more resources (registers, local memory).
in each core, 16 work-items physically active @ given time, , execute same "large instruction" (see vliw5) mapped on 5 arithmetic units (alu), gives 5*16 alu per core or 480 "stream processors" 6 cores.
work-items executed in blocks of 64 (a "wavefront" in amd terminology); 64 work-items executing same vliw5 instruction, , beingness physically executed in 4 passes of 16. why preferred workgroup size multiple of 64.
recent amd gpus have switched vliw4 model, each instruction maps 4 alu.
opencl gpgpu
No comments:
Post a Comment