Implement CUDA using Tensorflow Methods.
Posted: Thu Jul 07, 2022 4:03 pm
I was often disatisfied with CUDA methods available as examples on the net, often they were inaccurate and did not perform adequately for a problem that requires the following:
OxNK x NC= N x NC * NxOxNK x NC
O = flattened output dimension
NK = Number of Kernels
NC = Number of channels = RGB
N = flattened input dimension
(This is an amended version of the previous equation for Feedforwarding a Convolutional Neural Network:
OxNxNKxNC = NxKxNC * KxOxNK)
In order to calculate this using CUDA I could not iterate NK,N, and O all in parallel without using the Tensorflow Method (GPU_1D_KERN_LOOP) and my Transkern array:
NB. size of kernel = 5x5 = 25, Number of kernels = 4
The Transkern array produces a connection matrix using the Output and Input dimensions O_O x N_N there are NK Transkern matrix's.
The TF routine is implemented using a basic config using CUDA streams and an iterator defined as follows:
(This is copied straight from TF source code!)
You can see the familiar CUDA code being used by the iterator defined above. This iterates for any data type across dimension x of the CUDA grid, block's and threads.
It is then defined as a simple loop using the following simple define:
Here is the above routine with config (Note the important inclusion of CUDA streams:
I believe this method to be the best one for implementing CUDA for processing Deep Neural Nets and should make full use of the above methods used by TensorFlow.
OxNK x NC= N x NC * NxOxNK x NC
O = flattened output dimension
NK = Number of Kernels
NC = Number of channels = RGB
N = flattened input dimension
(This is an amended version of the previous equation for Feedforwarding a Convolutional Neural Network:
OxNxNKxNC = NxKxNC * KxOxNK)
In order to calculate this using CUDA I could not iterate NK,N, and O all in parallel without using the Tensorflow Method (GPU_1D_KERN_LOOP) and my Transkern array:
- Code: Select all
GPU_1D_KERN_LOOP(index, O_O*N_N*4){
const int nk = (index / 1 )% 4;
const int col = (index / 4 )% O_O;
const int row = (index / 4 /O_O)% N_N;
atomicAdd(&O_a[col+nk*O_O],I_a[row] * Kernel[nk*25+Transkern[row+col*N_N]]);
}
NB. size of kernel = 5x5 = 25, Number of kernels = 4
The Transkern array produces a connection matrix using the Output and Input dimensions O_O x N_N there are NK Transkern matrix's.
The TF routine is implemented using a basic config using CUDA streams and an iterator defined as follows:
(This is copied straight from TF source code!)
- Code: Select all
/*Am copying googles tensorflow here - to learn*/
namespace detail {
template <typename T>
class GpuGridRange {
struct Iterator {
__device__ Iterator(T index, T delta) : index_(index), delta_(delta) {}
__device__ T operator*() const { return index_;}
__device__ Iterator& operator++() {
index_ += delta_;
return *this;
}
__device__ bool operator!=(const Iterator& other) const {
bool greater = index_ > other.index_;
bool less = index_ < other.index_;
if(!other.delta_){
return less;
}
if(!delta_){
return greater;
}
return less || greater;
}
private:
T index_;
const T delta_;
}; //end Iterator struct
public:
__device__ GpuGridRange(T begin,T delta,T end)
: begin_(begin),delta_(delta),end_(end) {}
__device__ Iterator begin() const {return Iterator(begin_,delta_); }
__device__ Iterator end() const {return Iterator(end_,0);}
private:
T begin_;
T delta_;
T end_;
}; //end GPU class class
}; //end namespace detail
template <typename T> //Allows you to use GPU iterator with all data types
__device__ detail::GpuGridRange<T> GpuGridRangeX(T count) {
return detail::GpuGridRange<T>(
/*begin*/blockIdx.x * blockDim.x + threadIdx.x,
/*delta*/gridDim.x * blockDim.x, /*end*/count
);
}
You can see the familiar CUDA code being used by the iterator defined above. This iterates for any data type across dimension x of the CUDA grid, block's and threads.
It is then defined as a simple loop using the following simple define:
- Code: Select all
#define GPU_1D_KERN_LOOP(i, n) \
for (int i : ::GpuGridRangeX<int>(n))
Here is the above routine with config (Note the important inclusion of CUDA streams:
- Code: Select all
cudaStream_t stream1;
cudaStreamCreate(&stream1);
cudaMalloc((void **)&O_a, O_O * NK * sizeof(float));
cudaMemcpyAsync(O_a, Oarray, O_O * NK * sizeof(float), cudaMemcpyHostToDevice,stream1);
myroutine<<<1024,1024,0,stream1>>>(K_a,T_a,O_a,I_a,O_O,N_N);
I believe this method to be the best one for implementing CUDA for processing Deep Neural Nets and should make full use of the above methods used by TensorFlow.