First attempts to write a TensorFlow Clone!
Posted: Thu Mar 31, 2022 7:44 pm
Here is some baby steps first go at creating a Tensorflow like GPU platform for basic machine learning on the GPU using cuda.
Tensorflow source code reveals it uses an interesting function called GpuGridRange which basically an iterator in parallel on your GPU:
However we wont be using that just yet!
Here instead is my more basic implementation that will hopefully perform a convolution using multiple kernels over multiple channels and a deconvolution using a transposed kernel.
But before we begin we need to know that the kernel which is as we know it usually a 2d square of weights like this:
In order for it to be useable by the GPU needs to be transformed to an array with the same dimensions as the out put and the input. In order to do this we need to reshape the kernel to its output.
In order for the matrix multiplier to work each matrix must share a dimension using the format:
OxN = NxK * KxO
Example:
For a 10x10 input and a 4x4 kernel we can apply the kernel to the input 6 times along and 6 down so we get a 6x6 output! Ok.
So we need to reshape our kernel from 4x4 to 16x1 and from 16x1 to 16x(6x6) = 16x36 and thats our reshaped kernel matrix ready for our GPU matrix multiplyer this can be our KxO matrix. The kernel matrix.
The next matrix is the input which we will reshape first to 10x10x1 = 100x1 and then using out kernel dimensions reshape to 100x16 this can be our NxK matrix.
Finally our output gets reshaped also from a 6x6x1 to an 36x100 this makes out output matrix
OxN
we also need to added additional dimesnions for multiple kernels and channels (RGB):
so we get:
OxNxNKxNC = NxKxNC * KxOxNK
NK = number of kernels
NC = number of channels
The inputs are numbered in channels RGB normally, the kernel is numbered in number of kernels and the output is numbered by both. The GPU however just sees OxN = NxK *KxO
Here is the cuda GPU function:
It iterates using the cuda block and grid format which is initiallised here as follows:
This is a first try. The goal is to have reshaping, convolution, transposing and deconvolution all done using cuda functions like this.
So far so good!
Tensorflow source code reveals it uses an interesting function called GpuGridRange which basically an iterator in parallel on your GPU:
- Code: Select all
detail::GpuGridRange<T>(
/*begin*/blockIdx.x * blockDim.x + threadIdx.x,
/*delta*/gridDim.x * blockDim.x, /*end*/count
);
However we wont be using that just yet!
Here instead is my more basic implementation that will hopefully perform a convolution using multiple kernels over multiple channels and a deconvolution using a transposed kernel.
But before we begin we need to know that the kernel which is as we know it usually a 2d square of weights like this:
- Code: Select all
1,2,3,4
5,6,7,8
9,1,2,3
reshaped to 16x36
1,2,3,4,5,6,7,8,9,1,2,3
1,2,3,4,5,6,7,8,9,1,2,3
1,2,3,4,5,6,7,8,9,1,2,3
1,2,3,4,5,6,7,8,9,1,2,3
1,2,3,4,5,6,7,8,9,1,2,3
....... so on
Thats for a output of 6x6
In order for it to be useable by the GPU needs to be transformed to an array with the same dimensions as the out put and the input. In order to do this we need to reshape the kernel to its output.
In order for the matrix multiplier to work each matrix must share a dimension using the format:
OxN = NxK * KxO
Example:
For a 10x10 input and a 4x4 kernel we can apply the kernel to the input 6 times along and 6 down so we get a 6x6 output! Ok.
So we need to reshape our kernel from 4x4 to 16x1 and from 16x1 to 16x(6x6) = 16x36 and thats our reshaped kernel matrix ready for our GPU matrix multiplyer this can be our KxO matrix. The kernel matrix.
The next matrix is the input which we will reshape first to 10x10x1 = 100x1 and then using out kernel dimensions reshape to 100x16 this can be our NxK matrix.
Finally our output gets reshaped also from a 6x6x1 to an 36x100 this makes out output matrix
OxN
we also need to added additional dimesnions for multiple kernels and channels (RGB):
so we get:
OxNxNKxNC = NxKxNC * KxOxNK
NK = number of kernels
NC = number of channels
The inputs are numbered in channels RGB normally, the kernel is numbered in number of kernels and the output is numbered by both. The GPU however just sees OxN = NxK *KxO
Here is the cuda GPU function:
- Code: Select all
__global__ void gpu_matrix_mult(float *I_a,float *T_a,float *O_a,int K_K,int N_N,int O_O)
{
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
float sum = 0;
if( col < O_O && row < N_N)
{
for(int i = 0; i < K_K; i++)
{
sum += I_a[i + row * K_K] * T_a[col + i * O_O];/*I_a is 1 x N_N .. T_a is O_O x K_K*/
}
O_a[col + row * O_O] = sum; /*O_a is 1 x O_O */
}
}
It iterates using the cuda block and grid format which is initiallised here as follows:
- Code: Select all
int K_=kernelwidth;
int N_=insize;
int O_=N_-K_; //Width of output CNN
float *I_a,*T_a,*O_a;
int NC = 3,NK = 25; /*RGB 3 channels , 5 x kernels*/
unsigned int grid_rows = (N_*N_*K_*K_*NK*NC + BLOCK_SIZE - 1) / BLOCK_SIZE;
unsigned int grid_cols = (O_*O_*N_*N_*NK*NC + BLOCK_SIZE - 1) / BLOCK_SIZE;
dim3 grid(grid_cols, grid_rows);
dim3 block(BLOCK_SIZE, BLOCK_SIZE);
This is a first try. The goal is to have reshaping, convolution, transposing and deconvolution all done using cuda functions like this.
So far so good!