i want add 2 32-bit unsigned integers in cuda ptx , want take care of carry propagation. using code below that, result not expected.
acording documentation, add.cc.u32 d, a, b
performs integer addition , writes carry-out value condition code register, cc.cf
.
on other hand, addc.cc.u32 d, a, b
performs integer addition with carry-in , writes carry-out value condition code register. semantics of instruction be
d = + b + cc.cf
. tryed addc.u32 d, a, b
no difference.
#include <stdio.h> #include <stdlib.h> #include <cuda_runtime_api.h> #include "device_launch_parameters.h" #include <cuda.h> typedef unsigned int u32; #define try_cuda_call(x) \ \ { \ cudaerror_t err; \ err = x; \ if(err != cudasuccess) \ { \ printf("error %08x: %s @ %s in line %d\n", err, cudageterrorstring(err), __file__, __line__); \ exit(err); \ } \ } while(0) __device__ u32 __uaddo(u32 a, u32 b) { u32 res; asm("add.cc.u32 %0, %1, %2; /* inline */ \n\t" : "=r" (res) : "r" (a) , "r" (b)); return res; } __device__ u32 __uaddc(u32 a, u32 b) { u32 res; asm("addc.cc.u32 %0, %1, %2; /* inline */ \n\t" : "=r" (res) : "r" (a) , "r" (b)); return res; } __global__ void testing(u32* s) { u32 a, b; = 0xffffffff; b = 0x2; s[0] = __uaddo(a,b); s[0] = __uaddc(0,0); } int main() { u32 *s_dev; u32 *s; s = (u32*)malloc(sizeof(u32)); try_cuda_call(cudamalloc((void**)&s_dev, sizeof(u32))); testing<<<1,1>>>(s_dev); try_cuda_call( cudamemcpy(s, s_dev, sizeof(u32), cudamemcpydevicetohost) ); printf("s = %d;\n",s[0]); return 1; }
as far know, carry if result doesn't fit in variable, happens here , overflow if sign bit corrupted, i'm working unsigned values.
code above tries add 0xffffffff
0x2
, of course result won't fit on 32-bit, why don't 1 after __uaddc(0,0)
call?
edit
nvidia geforce gt 520mx
windows 7 ultimate, 64-bit
visual studio 2012
cuda 7.0
the data dependencies affecting asm()
statement explicitly expressed variable bindings. note can bind register operands, not condition codes. since in code result of __uaddo(a, b)
being overwritten, compiler determines not contribute observable results, therefore "dead code" , can eliminated. checked examining generated machine code (sass) release build cuobjdump --dump-sass
.
if had different code not allow compiler eliminate code __uaddo()
outright, there still issue compiler can schedule instructions likes between code generated __uaddo()
, __uaddc()
, , such instructions destroy setting of carry flag due __uaddo()
.
as consequence, if 1 plans use carry flag multi-word arithmetic, both carry-generating , carry-consuming instructions must occur in same asm()
statement. worked example can found in this answer shows how add 128-bit operands. alternatively, if 2 separate asm()
statements must used, 1 export carry flag setting earlier 1 c variable, import subsequent asm()
statement there. can't think of many situations practical, performance advantage of using carry flag lost.
Comments
Post a Comment