Just a quick note.
If you are starting from a template C++ CUDA project in VS 2012/2013, calling a kernel from a kernel (dynamic parallelism) would not compile:
error : kernel launch from __device__ or __global__ functions requires separate compilation mode
To fix this, first make sure your hardware supports it (cc 3.5 or higher) and compute capability is set correctly in the compiler options (“compute_35,sm_35”)
Then set -rdc=true in CUDA C++ options:
Then add cudadevrt.lib to the libraries that are statically linked in the Linker options: