上海交通大学：《Multicore Architecture and Parallel Computing》课程教学资源（PPT课件讲稿）Lecture 8 CUDA, cont’d.ppt

您所在的位置：网站首页 › threadidxx › 上海交通大学：《Multicore Architecture and Parallel Computing》课程教学资源（PPT课件讲稿）Lecture 8 CUDA, cont’d.ppt

上海交通大学：《Multicore Architecture and Parallel Computing》课程教学资源（PPT课件讲稿）Lecture 8 CUDA, cont’d.ppt

2024-07-04 14:07| 来源: 网络整理| 查看: 265

O) Problem in the Program In each iterations, two control flow paths will be sequentiall traversed for each warp Threads that perform addition and threads that do not Threads that do not perform addition may cost extra cycles depending on the implementation of divergence No more than half of threads will be executing at any time all odd index threads are disabled right from the beginning On average, less than 14 of the threads will be activated for all warps over time After the 5t iteration, entire warps in each block will be disabled, poor resource utilization but no divergence This can go on for a while, up to 4 more iterations(512/32=16=24). where each iteration only has one thread activated until all warps retire

Problem in the Program 16

通 Problem in the Program assume we have already loaded array into shared float partialSum unsigned int t= threadIdx. x; BAD. Divergence due to interleaved for (unsigned int stride = li branch decisions stride blockDim x: str syncthreads()氵 if (t(2*stri 0) partialSum[t] + partialSum[t+strideli 17

Problem in the Program 17

G) Better Implementation assume we have already loaded array into shared float partialsum unsigned int t= threadIdxx for (unsigned int stride blockDimxi stride >l; stride > 1 syncthreads() if (t< stride) partialsum[t] + partialsum[t+stride]i

Better Implementation 18

【本文地址】

公司简介

联系我们