上海交通大学:《Multicore Architecture and Parallel Computing》课程教学资源(PPT课件讲稿)Lecture 8 CUDA, cont’d.ppt | 您所在的位置:网站首页 › threadidxx › 上海交通大学:《Multicore Architecture and Parallel Computing》课程教学资源(PPT课件讲稿)Lecture 8 CUDA, cont’d.ppt |
![]() O) Problem in the Program In each iterations, two control flow paths will be sequentiall traversed for each warp Threads that perform addition and threads that do not Threads that do not perform addition may cost extra cycles depending on the implementation of divergence No more than half of threads will be executing at any time all odd index threads are disabled right from the beginning On average, less than 14 of the threads will be activated for all warps over time After the 5t iteration, entire warps in each block will be disabled, poor resource utilization but no divergence This can go on for a while, up to 4 more iterations(512/32=16=24). where each iteration only has one thread activated until all warps retire Problem in the Program 16 ![]() 通 Problem in the Program assume we have already loaded array into shared float partialSum unsigned int t= threadIdx. x; BAD. Divergence due to interleaved for (unsigned int stride = li branch decisions stride blockDim x: str syncthreads()氵 if (t(2*stri 0) partialSum[t] + partialSum[t+strideli 17 Problem in the Program 17 ![]() G) Better Implementation assume we have already loaded array into shared float partialsum unsigned int t= threadIdxx for (unsigned int stride blockDimxi stride >l; stride > 1 syncthreads() if (t< stride) partialsum[t] + partialsum[t+stride]i Better Implementation 18 |
CopyRight 2018-2019 实验室设备网 版权所有 |