Implement CUDA using Tensorflow Methods.

by **hbyte** » Thu Jul 07, 2022 4:03 pm

I was often disatisfied with CUDA methods available as examples on the net, often they were inaccurate and did not perform adequately for a problem that requires the following:

OxNK x NC= N x NC * NxOxNK x NC

O = flattened output dimension
NK = Number of Kernels
NC = Number of channels = RGB
N = flattened input dimension

(This is an amended version of the previous equation for Feedforwarding a Convolutional Neural Network:
OxNxNKxNC = NxKxNC * KxOxNK)

In order to calculate this using CUDA I could not iterate NK,N, and O all in parallel without using the Tensorflow Method (GPU_1D_KERN_LOOP) and my Transkern array:

Code: Select all: GPU_1D_KERN_LOOP(index, O_O*N_N*4){ const int nk = (index / 1 )% 4; const int col = (index / 4 )% O_O; const int row = (index / 4 /O_O)% N_N; atomicAdd(&O_a[col+nk*O_O],I_a[row] * Kernel[nk*25+Transkern[row+col*N_N]]); }

NB. size of kernel = 5x5 = 25, Number of kernels = 4

The Transkern array produces a connection matrix using the Output and Input dimensions O_O x N_N there are NK Transkern matrix's.

The TF routine is implemented using a basic config using CUDA streams and an iterator defined as follows:

(This is copied straight from TF source code!)

Code: Select all: /*Am copying googles tensorflow here - to learn*/ namespace detail { template <typename T> class GpuGridRange { struct Iterator { __device__ Iterator(T index, T delta) : index_(index), delta_(delta) {} __device__ T operator*() const { return index_;} __device__ Iterator& operator++() { index_ += delta_; return *this; } __device__ bool operator!=(const Iterator& other) const { bool greater = index_ > other.index_; bool less = index_ < other.index_; if(!other.delta_){ return less; } if(!delta_){ return greater; } return less || greater; } private: T index_; const T delta_; }; //end Iterator struct public: __device__ GpuGridRange(T begin,T delta,T end) : begin_(begin),delta_(delta),end_(end) {} __device__ Iterator begin() const {return Iterator(begin_,delta_); } __device__ Iterator end() const {return Iterator(end_,0);} private: T begin_; T delta_; T end_; }; //end GPU class class }; //end namespace detail template <typename T> //Allows you to use GPU iterator with all data types __device__ detail::GpuGridRange<T> GpuGridRangeX(T count) { return detail::GpuGridRange<T>( /*begin*/blockIdx.x * blockDim.x + threadIdx.x, /*delta*/gridDim.x * blockDim.x, /*end*/count ); }

You can see the familiar CUDA code being used by the iterator defined above. This iterates for any data type across dimension x of the CUDA grid, block's and threads.

It is then defined as a simple loop using the following simple define:

Code: Select all: #define GPU_1D_KERN_LOOP(i, n) \ for (int i : ::GpuGridRangeX<int>(n))

Here is the above routine with config (Note the important inclusion of CUDA streams:

Code: Select all: cudaStream_t stream1; cudaStreamCreate(&stream1); cudaMalloc((void **)&O_a, O_O * NK * sizeof(float)); cudaMemcpyAsync(O_a, Oarray, O_O * NK * sizeof(float), cudaMemcpyHostToDevice,stream1); myroutine<<<1024,1024,0,stream1>>>(K_a,T_a,O_a,I_a,O_O,N_N);

I believe this method to be the best one for implementing CUDA for processing Deep Neural Nets and should make full use of the above methods used by TensorFlow.

Implement CUDA using Tensorflow Methods.

Implement CUDA using Tensorflow Methods.

Who is online