Tuesday, 15 March 2011

gpgpu - OpenCL abstraction and actual hardware -



gpgpu - OpenCL abstraction and actual hardware -

i'm trying develop improve intuition of mapping between opencl's abstraction , actual hardware. instance, using late-2011 macbook pro's configuration:

1)

radeon 6770m gpu: http://www.amd.com/us/products/notebook/graphics/amd-radeon-6000m/amd-radeon-6700m-6600m/pages/amd-radeon-6700m-6600m.aspx#2

"480 stream processors" guess of import number there.

2)

on other hand opencl api gives me these numbers:

device_name = ati radeon hd 6770m driver_version = 1.0 device_vendor = amd device_version = opencl 1.1 device_max_compute_units = 6 device_max_clock_frequency = 675 device_global_mem_size = 1073741824 device_local_mem_size = 32768 cl_device_address_bits = 32 cl_device_global_mem_cache_size = 0 cl_device_global_mem_cacheline_size = 0 cl_device_max_constant_buffer_size = 65536 cl_device_max_work_group_size = 1024 cl_device_max_work_item_dimensions = 3 cl_device_max_work_item_sizes = (1024, 1024, 1024)

and querying work grouping size , multiple trivial kernel (pass-through float4 form input output global mem)

cl_kernel_preferred_workgroup_size_multiple = 64 cl_kernel_work_group_size = 256

3)

the opencl specification states entire work grouping must able run concurrently on device's compute unit.

4)

opencl give device's simd-width through multiple, 64 in above case.

somehow cannot set "6" "480" , powers of 2 in relationship. if number of compute units 6 , simd width 64 384.

can explain how these numbers relate, hardware?

in gpu, each "compute unit" core executing 1 or more work-groups.

the max size of each work-group 256 specific kernel (obtained clgetkernelworkgroupinfo). can less if kernel requires more resources (registers, local memory).

in each core, 16 work-items physically active @ given time, , execute same "large instruction" (see vliw5) mapped on 5 arithmetic units (alu), gives 5*16 alu per core or 480 "stream processors" 6 cores.

work-items executed in blocks of 64 (a "wavefront" in amd terminology); 64 work-items executing same vliw5 instruction, , beingness physically executed in 4 passes of 16. why preferred workgroup size multiple of 64.

recent amd gpus have switched vliw4 model, each instruction maps 4 alu.

opencl gpgpu

No comments:

Post a Comment