Overview

Implement a parallel sorting algorithm that outperforms single-threaded `std::sort`.

Implementation

We will implement a benchmark tool `so-test` that performs the following steps:

• Generates test data.
• Sorts it with your sorting algorithm, and prints the running time.
• Sorts it with `std::sort`, and prints the running time.
• Verifies that the results are identical.

Usage

The benchmark tool can be used as follows:

```    ./so-test N
```

where N is the number of elements in the test data. The tool will run 5 tests, all with the same number of elements but with different input data distributions: 0 = large random elements, 1 = small random elements, 2 = constant input, 3 = increasing values, and 4 = decreasing values.

If you run, for example

```    ./so-test 100000000
```

you will get an output that looks similar to the following:

```    so  0  100000000  2.135  7.957
so  1  100000000  1.574  2.325
so  2  100000000  1.357  1.735
so  3  100000000  1.079  1.713
so  4  100000000  1.136  1.233
```

Here column 2 identifies the input data distribution; the first row is the most interesting case, with large random elements. Column 4 is the running time of your implementation and column 5 is the running time of `std::sort`.

Template

You can find a working implementation in our shared code repository, in the subdirectories `so*`. There is one directory per task.

You only need to implement the subroutine `psort()` in file `so*/so.cc`. See `so*/so.h` for the description of the interface.

Once you have implemented the function correctly, you should be able to run `make` to compile your code, `make test` to check that it works correctly (at least for some test cases) and `make benchmark` to see how well it performs.

Detailed specification

You need to implement the following function:

```    void psort(int n, data_t* data)
```

Here `n` is the size of the input array `data`. All input elements are of type `data_t`, i.e., `unsigned long long`, i.e., 64-bit unsigned integers.

Correct output

Your function should have the same behaviour as:

```    std::sort(data, data+n);
```

That is, you need to sort this in place: modify the input array so that it will be in a sorted order. You are free to allocate some additional memory to keep temporary data if needed.

Rules

In your implementation, you are permitted and encouraged to use all single-threaded features of the C++ standard library. In particular, you can freely use the single-threaded `std::sort` as a subroutine in your code, as well as all other features of the algorithms library.

For multi-threading, you can use the basic primitives of OpenMP, as usual.

Subdirectory: so1, file to edit: so1/so.cc.

Implement an efficient parallel sorting algorithm, using the basic idea of merge sort.

If you run `make benchmark2` and you get running times in the following ballpark, you can be happy:

```    so  0  100000000  2.135  7.957
so  1  100000000  1.574  2.325
so  2  100000000  1.357  1.735
so  3  100000000  1.079  1.713
so  4  100000000  1.136  1.233
```

Subdirectory: so2, file to edit: so2/so.cc.

Implement an efficient parallel sorting algorithm, using the basic idea of quicksort.

If you run `make benchmark2` and you get running times in the following ballpark, you can be happy:

```    so  0  100000000  2.161  7.973
so  1  100000000  0.772  2.313
so  2  100000000  0.118  1.805
so  3  100000000  0.686  1.750
so  4  100000000  1.483  1.273
```

Subdirectory: so3, file to edit: so3/so.cu.

Use CUDA to implement an efficient algorithm for sorting on GPU, using the basic idea of merge sort.

Subdirectory: so4, file to edit: so4/so.cu.

Use CUDA to implement an efficient algorithm for sorting on GPU, using the basic idea of quicksort.

Hints

If you have p threads available, you can split your input data in p blocks. Sort each block with e.g. `std::sort`, simultaneously in parallel. Then merge the results. For example, first merge p/2 pairs of blocks in parallel so that you are left with p/2 sorted blocks, and repeat this until you are left with 1 sorted block.

• `#pragma omp parallel num_threads(p)`
• `omp_get_max_threads()`
• `omp_get_thread_num()`

• When you are left with only 1 thread or few elements, resort to `std::sort`.
• `#pragma omp single`
• `#pragma omp task`