Cuda Investigations - Part 1 - The Parallel Sorting Algo

by **hbyte** » Sat Sep 02, 2023 1:16 pm

Cuda Investigations - Part 1 - The Parallel Sorting Algo

I needed something to build a very fast Sorting Algorithm that could take alot of arrays and sort them all at once.

Obviously this is a job for CUDA but how?

When I researched online I found a couple of sorting algo's derived from CUDA Nvidia Team and others. They all employed
CUDA to perform a divide and conquer methodology.

This is where a list is divided into a left and right section of a list of values - normally integers. Then each side is iterated on and compared with its opposite. If the left side greater than the right side they are swapped. Eventually you end up with a list which is sorted in ascending order left to right. This is a simple sort.

Code: Select all: __device__ void selection_sort( int *data, int left, int right ) { for( int i = left ; i <= right ; ++i ){ int min_val = data[i]; int min_idx = i; // Find the smallest value in the range [left, right]. for( int j = i+1 ; j <= right ; ++j ){ int val_j = data[j]; if( val_j < min_val ){ min_idx = j; min_val = val_j; } } // Swap the values. if( i != min_idx ){ data[min_idx] = data[i]; data[i] = min_val; } } }

In a Merge sort the list is broken down into smaller lists each are sorted and then merged back into the whole. This process is normally recursive as each left and right side of the list is broken down to smaller segments - divide and conquer.

The problem with Merge sort was that for larger lists it wouldn't perform well enough in parallel(for multiple lists). But with some help I found a method that did.

See here for some useful code:

https://www.geeksforgeeks.org/merge-sort/

Many CUDA sorting methods employed Recursion and this method drew me in. It was different to the usual Kernel methods I had since used in doing matrix math for neural networks.

I wondered whether recursion, streaming and kernels in cuda could be used to run a paralellel sort over multiple lists.

Recursion in CUDA:

Code: Select all: // Parallel quicksort kernel __global__ void quicksort(int* data, int low, int high) { if (low < high) { int pi = partition(data, low, high); quicksort<<<1, 1>>>(data, low, pi - 1); quicksort<<<1, 1>>>(data, pi + 1, high); } }

As you can see in recursion it calls itself twice! Weird!!

**********

My original method of sorting an array on the CPU see below was not working in CUDA for some reason.

Original Method (CPU)

Code: Select all: do{ SORTED=1; for(int i=0;i<SIZE;i++){ if(NN_l[i])<NN_l[i+1]]){ TMP = NN_l[i+1]; NN_l[i+1]=NN_l[i]; NN_l[i]=TMP; SORTED=0; } } }while(SORTED==0);

The above really simple method came out of my little black book of C programming.

If it can be made to work on CUDA - one day perhaps.

I asked chatGPT for some help So here is what I and chatGPT came up with...

So I told chatGPT all about my attempts at building a parallel sorting algorithm in CUDA. First I showed it my code from an attempt I made using Merge Sorting which broke down after it required too many threads. Then after some more prompting it showed me this one:

ChatGPT:

Code: Select all: // Swap two elements in an array __device__ void swap(int* data, int a, int b) { int temp = data[a]; data[a] = data[b]; data[b] = temp; } // Partition function for quicksort __device__ int partition(int* data, int low, int high) { int pivot = data[high]; int i = low - 1; for (int j = low; j < high; j++) { if (data[j] <= pivot) { i++; swap(data, i, j); } } swap(data, i + 1, high); return i + 1; } // Parallel quicksort kernel __global__ void quicksort(int* data, int low, int high) { if (low < high) { int pi = partition(data, low, high); quicksort<<<1, 1>>>(data, low, pi - 1); quicksort<<<1, 1>>>(data, pi + 1, high); } }

(**I have since found the source used by chatGPT:
https://github.com/harshvardhan33/Quick ... r/quickc.c)

It called it quicksort and it is also recursive. And it worked really well I simply adapted it to run in parallel by using the following kernel:

My bit:

Code: Select all: __global__ void launch_mysort(int* d_data,int numNeurons, int arraySize) { const int total=8; cudaStream_t* s; s = new cudaStream_t[total]; GPU_1D_KERN_LOOP(index, numNeurons*total){ const int tid = (index / 1 ) %numNeurons; //neuron const int nid = (index / numNeurons) %total; //iteration int start=nid*(arraySize/total); int end =nid*(arraySize/total)+arraySize/total; cudaStreamCreateWithFlags(&s[nid],cudaStreamNonBlocking); quicksort<<<1, 1,0,s[nid]>>>(d_data+tid*arraySize,start,end-1); cudaStreamDestroy(s[nid]); } cudaDeviceSynchronize(); GPU_1D_KERN_LOOP(index, numNeurons){ const int tid = (index / 1 ) %numNeurons; //neuron quicksort<<<1, 1,0,0>>>(d_data+tid*arraySize, 0, arraySize - 1); } }

In my adaption I do further divide and conquer such that each array is divided into 8 smaller arrays these are then sorted in parallel streams then under a new kernel I then run it over the entire lot. And it works on arrays up to 100 elements and for a 1000 lists all in parallel. Wow.

I havent explored how much further it will go...

... Ok so basically it looks like the higher you go the more streams you need.

Cuda Investigations - Part 1 - The Parallel Sorting Algo

Cuda Investigations - Part 1 - The Parallel Sorting Algo

Who is online