cuda - Low performance kernel -
i have cuda kernel there many operations , few branches. looks like
__global__ void kernel(real *randomvalues, real mu, real sigma) { int row = blockdim.y * blockidx.y + threadidx.y; int col = blockdim.x * blockidx.x + threadidx.x; if ( row >= cntimesteps || col >= cnpaths ) return; real alphalevel = randomvalues[row*cnpaths+col]; real q = 0.0; real x = 0.0; if ( alphalevel < p_low) { q = sqrt( -2*log( alphalevel ) ); x = (((((c1*q+c2)*q+c3)*q+c4)*q+c5)*q+c6) / ((((d1*q+d2)*q+d3)*q+d4)*q+1); } else if ( alphalevel < p_high ) { q = alphalevel-0.5; real r = q*q; x= (((((a1*r+a2)*r+a3)*r+a4)*r+a5)*r+a6)*q / (((((b1*r+b2)*r+b3)*r+b4)*r+b5)*r+1); } else { q = sqrt( -2*log( 1.0-alphalevel ) ); x = -(((((c1*q+c2)*q+c3)*q+c4)*q+c5)*q+c6) / ((((d1*q+d2)*q+d3)*q+d4)*q+1); } randomvalues[row*cnpaths+col] = sigma * x + mu; }
where a
's, b
's, c
's , d
's constant values (in device constant memory)
static __device__ __constant__ real a1 = 1.73687; static __device__ __constant__ real a2 = 1.12321100;
and on.
after profiling kernel found theoretical occupancy 100% getting no more 60%.
i went through this , this gtc talks try optimize kernel.
on 1 side have ipc reports average of 1.32 issued instructions , 0.62 executed. instruction serialization 50% sm activity 100%. on other hand, there around 38 active warps 8 eligible execute next instruction on warp issue efficiency around 70% of cycles there no eligible warp. stall reasons reported "other" think has computation of log
, sqrt
.
- how can sm activity 99.82% if of cycles there no eligible warp?
- how can reduce stall?
- as threads in warp may not go same branch, requests constant memory seralized, true? should put constants in global memory (maybe use shared memory also)?
is first time use nsight visual studio i'm trying figure out meaning of performance analysis. btw card quadro k4000.
1) how can sm activity 99.82% if of cycles there no eligible warp?
a warp active if registers , warp slot allocated warp. sm active if @ least 1 warp active on sm.
sm activity should not confused efficiency.
2) how can reduce stall?
in case of code above warps stalled waiting the double precision execution units available. quadro k4000 has throughput of 8 threads/cycle double precision operations.
the remedies problem are: a. decrease number of double precision operations. example, moving consecutive operations float may improve performance single precision floating point throughput 24x double precision throughput. b. execute kernel on gk110 has 8x double precision throughput of gk10x.
increasing achieved occupancy may not increase performance of kernel on k4000. have provided insufficient information determine why achieved occupancy less theoretical occupancy.
the achieved flops experiment can used confirm if kernel performance bound double precision throughput.
3) threads in warp may not go same branch, requests constant memory seralized, true? should put constants in global memory (maybe use shared memory also)?
the code has no memory address divergence in constant memory loads. warp control flow divergence means on each request on portion of threads active.
the initial global load may not coalesced. need provide value of cnpaths review. @ memory experiments or source correlated experiments.
the if , else statement may able coded in more efficient manner allow compiler use predication instead of divergence branches.
Comments
Post a Comment