CUDA - PTX carry propagation -


i want add 2 32-bit unsigned integers in cuda ptx , want take care of carry propagation. using code below that, result not expected.
acording documentation, add.cc.u32 d, a, b performs integer addition , writes carry-out value condition code register, cc.cf.
on other hand, addc.cc.u32 d, a, b performs integer addition with carry-in , writes carry-out value condition code register. semantics of instruction be
d = + b + cc.cf. tryed addc.u32 d, a, b no difference.

#include <stdio.h> #include <stdlib.h> #include <cuda_runtime_api.h> #include "device_launch_parameters.h" #include <cuda.h>  typedef unsigned int u32; #define try_cuda_call(x) \ \   { \     cudaerror_t err; \     err = x; \     if(err != cudasuccess) \   { \     printf("error %08x: %s @ %s in line %d\n", err, cudageterrorstring(err), __file__, __line__); \     exit(err); \   } \ } while(0)   __device__ u32 __uaddo(u32 a, u32 b) {     u32 res;     asm("add.cc.u32 %0, %1, %2; /* inline */ \n\t"          : "=r" (res) : "r" (a) , "r" (b));     return res; }  __device__ u32 __uaddc(u32 a, u32 b) {     u32 res;     asm("addc.cc.u32 %0, %1, %2; /* inline */ \n\t"          : "=r" (res) : "r" (a) , "r" (b));     return res; }  __global__ void testing(u32* s) {     u32 a, b;      = 0xffffffff;     b = 0x2;      s[0] = __uaddo(a,b);     s[0] = __uaddc(0,0);  }  int main() {     u32 *s_dev;     u32 *s;     s = (u32*)malloc(sizeof(u32));     try_cuda_call(cudamalloc((void**)&s_dev, sizeof(u32)));     testing<<<1,1>>>(s_dev);     try_cuda_call( cudamemcpy(s, s_dev, sizeof(u32), cudamemcpydevicetohost) );      printf("s = %d;\n",s[0]);       return 1; } 

as far know, carry if result doesn't fit in variable, happens here , overflow if sign bit corrupted, i'm working unsigned values.
code above tries add 0xffffffff 0x2 , of course result won't fit on 32-bit, why don't 1 after __uaddc(0,0) call?

edit

nvidia geforce gt 520mx
windows 7 ultimate, 64-bit
visual studio 2012
cuda 7.0

the data dependencies affecting asm() statement explicitly expressed variable bindings. note can bind register operands, not condition codes. since in code result of __uaddo(a, b) being overwritten, compiler determines not contribute observable results, therefore "dead code" , can eliminated. checked examining generated machine code (sass) release build cuobjdump --dump-sass.

if had different code not allow compiler eliminate code __uaddo() outright, there still issue compiler can schedule instructions likes between code generated __uaddo() , __uaddc(), , such instructions destroy setting of carry flag due __uaddo().

as consequence, if 1 plans use carry flag multi-word arithmetic, both carry-generating , carry-consuming instructions must occur in same asm() statement. worked example can found in this answer shows how add 128-bit operands. alternatively, if 2 separate asm() statements must used, 1 export carry flag setting earlier 1 c variable, import subsequent asm() statement there. can't think of many situations practical, performance advantage of using carry flag lost.


Comments