First attempts to write a TensorFlow Clone!

by **hbyte** » Thu Mar 31, 2022 7:44 pm

Here is some baby steps first go at creating a Tensorflow like GPU platform for basic machine learning on the GPU using cuda.

Tensorflow source code reveals it uses an interesting function called GpuGridRange which basically an iterator in parallel on your GPU:

Code: Select all: detail::GpuGridRange<T>( /*begin*/blockIdx.x * blockDim.x + threadIdx.x, /*delta*/gridDim.x * blockDim.x, /*end*/count );

However we wont be using that just yet!

Here instead is my more basic implementation that will hopefully perform a convolution using multiple kernels over multiple channels and a deconvolution using a transposed kernel.

But before we begin we need to know that the kernel which is as we know it usually a 2d square of weights like this:

Code: Select all: 1,2,3,4 5,6,7,8 9,1,2,3 reshaped to 16x36 1,2,3,4,5,6,7,8,9,1,2,3 1,2,3,4,5,6,7,8,9,1,2,3 1,2,3,4,5,6,7,8,9,1,2,3 1,2,3,4,5,6,7,8,9,1,2,3 1,2,3,4,5,6,7,8,9,1,2,3 ....... so on Thats for a output of 6x6

In order for it to be useable by the GPU needs to be transformed to an array with the same dimensions as the out put and the input. In order to do this we need to reshape the kernel to its output.
In order for the matrix multiplier to work each matrix must share a dimension using the format:

OxN = NxK * KxO

Example:

For a 10x10 input and a 4x4 kernel we can apply the kernel to the input 6 times along and 6 down so we get a 6x6 output! Ok.

So we need to reshape our kernel from 4x4 to 16x1 and from 16x1 to 16x(6x6) = 16x36 and thats our reshaped kernel matrix ready for our GPU matrix multiplyer this can be our KxO matrix. The kernel matrix.

The next matrix is the input which we will reshape first to 10x10x1 = 100x1 and then using out kernel dimensions reshape to 100x16 this can be our NxK matrix.

Finally our output gets reshaped also from a 6x6x1 to an 36x100 this makes out output matrix
OxN

we also need to added additional dimesnions for multiple kernels and channels (RGB):

so we get:

OxNxNKxNC = NxKxNC * KxOxNK

NK = number of kernels
NC = number of channels

The inputs are numbered in channels RGB normally, the kernel is numbered in number of kernels and the output is numbered by both. The GPU however just sees OxN = NxK *KxO

Here is the cuda GPU function:

Code: Select all: __global__ void gpu_matrix_mult(float *I_a,float *T_a,float *O_a,int K_K,int N_N,int O_O) { int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; float sum = 0; if( col < O_O && row < N_N) { for(int i = 0; i < K_K; i++) { sum += I_a[i + row * K_K] * T_a[col + i * O_O];/*I_a is 1 x N_N .. T_a is O_O x K_K*/ } O_a[col + row * O_O] = sum; /*O_a is 1 x O_O */ } }

It iterates using the cuda block and grid format which is initiallised here as follows:

Code: Select all: int K_=kernelwidth; int N_=insize; int O_=N_-K_; //Width of output CNN float *I_a,*T_a,*O_a; int NC = 3,NK = 25; /*RGB 3 channels , 5 x kernels*/ unsigned int grid_rows = (N_*N_*K_*K_*NK*NC + BLOCK_SIZE - 1) / BLOCK_SIZE; unsigned int grid_cols = (O_*O_*N_*N_*NK*NC + BLOCK_SIZE - 1) / BLOCK_SIZE; dim3 grid(grid_cols, grid_rows); dim3 block(BLOCK_SIZE, BLOCK_SIZE);

This is a first try. The goal is to have reshaping, convolution, transposing and deconvolution all done using cuda functions like this.

So far so good!

by **hbyte** » Thu Mar 31, 2022 9:11 pm

Code: Select all: #include <algorithm> #include <complex> #include <iostream> #define FRACTION_CEILING(numerator, denominator) ((numerator+denominator-1)/denominator) using namespace std; __global__ void gpu_matrix_mult(float *I_a,float *T_a,float *O_a, int K_K, int N_N, int O_O); class Deconvolutor { /*Kernel function example that uses the above*/ public: double getlrand(double lower,double upper){ return ((double) rand()/ RAND_MAX) * (upper-lower) + lower; } void allOnes(int N_,int ND,float* Iarray){ //Set Inputs on device from external or random int val; for(int i=0;i<ND;i++){ //This could be set to an on device function for(int j=0;j<N_;j++){ for(int k=0;k<N_;k++){ val = (int)getlrand(0,10); Iarray[i+N_*(j+N_*k)] = (float)val; } } } } void setInputs(int N_,int ND,float* Iarray){ //Set Inputs on device from external or random int val; for(int i=0;i<ND;i++){ //This could be set to an on device function for(int j=0;j<N_;j++){ for(int k=0;k<N_;k++){ val = (int)getlrand(0,10); Iarray[i+N_*(j+N_*k)] = (float)val; } } } } void reshapeKernel(int O_,int K_,int NK,int NC, float* Kernel,float* Tarray){ /*Without padding*/ int l=0,k=-1,count; float *Tarray_,*Kernel_,*Kernel1D; Tarray_ = new float[O_*O_*K_*K_]; Kernel_ = new float[K_*K_]; Kernel1D = new float[K_*K_]; //for(int m=0;m<NC;m++){ for(int n=0;n<NK;n++){ count=-1; for(int f=0;f<K_;f++){ for(int g=0;g<K_;g++){ count++; Kernel1D[count] = Kernel_[f+K_*g] = Kernel[n+K_*(f+K_*g)]; }} for(int i=0;i<O_*O_;i++){ k=-1; l=0; for(int j=0;j<K_*K_;j++){ if(k==K_-1){l++;k=-1;}k++; //Tarray_[i*K_*K_+j] = Kernel1D[j]; Tarray_[i*K_*K_+j] = Kernel_[l+k*K_]; }} //cout<<"\nTarray:\n"; for(int i=0;i<O_*O_;i++){ //cout<<"\n"; for(int j=0;j<K_*K_;j++){ Tarray[n*O_*O_*K_*K_+(i*K_*K_+j)] = Tarray_[i*K_*K_+j]; //cout<<","<<Tarray[i+K_*K_*j]; }} } } /*class wrapper function for layer*/ #define BLOCK_SIZE 16 void runlayer(int kernelwidth,int insize){ int K_=kernelwidth; int N_=insize; int O_=N_-K_; //Width of output CNN float *I_a,*T_a,*O_a; int NC = 3,NK = 25; /*RGB 3 channels , 25 x kernels*/ unsigned int grid_rows = (N_*N_*K_*K_*NK*NC + BLOCK_SIZE - 1) / BLOCK_SIZE; unsigned int grid_cols = (O_*O_*N_*N_*NK*NC + BLOCK_SIZE - 1) / BLOCK_SIZE; dim3 grid(grid_cols, grid_rows); dim3 block(BLOCK_SIZE, BLOCK_SIZE); float *Iarray,*Oarray,*Tarray,*Kernel; /*Have this as input variable to this wrapper*/ Iarray = new float[N_*N_*K_*K_*NK*NC]; /*100 x 1 = 10x10 x 1*/ Oarray = new float[O_*O_*N_*N_*NK*NC]; /*36 x 1 = 6x6 x 1*/ Tarray = new float[O_*O_*K_*K_*NK*NC]; /*36 x 16 = 6x6 x 4x4*/ Kernel = new float[K_*K_*NK]; srand(time(NULL)); /*Seed RNG*/ allOnes(N_,NC,Iarray); /*Randomise Inputs*/ setInputs(K_,NK,Kernel); /*Randomise Kernel wgts*/ reshapeKernel(O_,K_,NK,NC,Kernel,Tarray); /*Reshape Kernel with output Dim*/ //Calculate size of each Matrix int K_K = K_*K_*NK; int O_O = O_*O_*NK; int N_N = N_*N_*NC; cudaMalloc((void **)&I_a, N_ * N_ * K_ * K_ * NK * NC *sizeof(float)); cudaMalloc((void **)&T_a, O_ * O_ * K_ * K_ * NK * NK *sizeof(float)); cudaMalloc((void **)&O_a, O_ * O_ * N_ * N_ * NK * NC *sizeof(float)); cudaMemcpy(I_a, Iarray, N_ * N_ * K_ * K_ * NK * NC *sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(T_a, Tarray, K_ * K_ * O_ * O_ * NK * NC *sizeof(float), cudaMemcpyHostToDevice); //Do the Kernel function gpu_matrix_mult<<<grid,block>>>(I_a,T_a,O_a,K_K,N_N,O_O); //Copy the output matrix back cudaMemcpy(Oarray, O_a, O_ * O_ * N_ * N_ * NK * NC *sizeof(float), cudaMemcpyDeviceToHost); cudaFree(O_a); cudaFree(I_a); cudaFree(T_a); cout<<"\nInput:\n"; for(int i=0;i<NC;i++){ cout<<"\nChannel:"<<i<<"\n"; for(int j=0;j<N_;j++){ cout<<"\n"; for(int k=0;k<N_;k++){ cout<<Iarray[i+N_*(j+k*N_)]<<","; }} } cout<<"Kernel:\n"; for(int i=0;i<NK;i++){ cout<<"\nKernel "<<i<<"\n"; for(int j=0;j<K_;j++){ cout<<"\n"; for(int k=0;k<K_;k++){ cout<<Kernel[i+K_*(j+k*K_)]<<","; } } } cout<<"\nReshaped Kernel:\n"; for(int j=0;j<NK;j++){ cout<<"\nKernel:"<<j<<"\n"; for(int k=0;k<O_*O_;k++){ cout<<"\n"; for(int l=0;l<K_*K_;l++){ cout<<Tarray[j*O_*O_*K_*K_+(k*K_*K_+l)]<<","; } } } cout<<"Yey!\n"; cout<<"\nOutput:\n"; for(int i=0;i<NK;i++){ cout<<"\n"; for(int j=0;j<O_;j++){ cout<<"\n"; for(int k=0;k<O_;k++){ cout<<Oarray[i+O_*O_*(j+k*O_)]<<","; } }} free(Iarray); free(Oarray); free(Tarray); free(Kernel); } }; __global__ void gpu_matrix_mult(float *I_a,float *T_a,float *O_a,int K_K,int N_N,int O_O) { int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; float sum = 0; if( col < O_O && row < N_N) { for(int i = 0; i < K_K; i++) { sum += I_a[i + row * K_K] * T_a[col + i * O_O];/*I_a is 1 x N_N .. T_a is O_O x K_K*/ } O_a[col + row * O_O] = sum; /*O_a is 1 x O_O */ } } int main(){ Deconvolutor Decon; Decon.runlayer(4,10); }

by **hbyte** » Fri Apr 01, 2022 12:57 pm

This example is lovely and it did appear to work. However on closer scrutiny these very large 1d arrays are extremely volatile with regards to memory which makes me think I need to be using std matrix or structs.

I don't know. I will have to examine the TF source. Which is awesome!

First attempts to write a TensorFlow Clone!

First attempts to write a TensorFlow Clone!

Re: First attempts to write a TensorFlow Clone!

Re: First attempts to write a TensorFlow Clone!

Who is online