In general, always try to isolate the problem first! Figure out what is the smallest, simplest code that still does something unexpected. Remember that you are not limited to use our makefiles and test scripts, but you can always develop e.g. your own unit tests.
Also make sure that you are using Maari-A computers. You do not need to be physically there, you can use ssh.
First try AddressSanitizer — see below for more information.
It might be a stack overflow. Unfortunately, a stack overflow is typically reported as a segmentation fault. In the classroom computers, the stack size limit is approx. 8MB. Do not allocate large arrays on the stack. If you need to allocate storage for megabytes of data, use the heap.
It might be a stack overflow (see above). Gnu Make accidentally changes the stack size limits so you might see a stack overflow (segmentation fault) when you run your code directly, yet it might work fine if you run it with make
.
You might be reading wrong parts of the memory. Try AddressSanitizer — see below for more information.
You might be reading memory that is not initialised. Try malloc debugging options — see below for more information.
Check for errors. Wrap all CUDA API calls in error-checking macros, and also check for errors after each kernel launch. For example, you can define a macro like this:
#define CHECK_CUDA_ERROR(call) do { \ cudaError_t result_ = (call); \ if (result_ != cudaSuccess) { \ fprintf(stderr, #call " failed: %s\n", \ cudaGetErrorString(result_)); \ exit(1); \ } \ } while(0)
And use it like this:
... CHECK_CUDA_ERROR(cudaMalloc((void**)&x, n)); CHECK_CUDA_ERROR(cudaMalloc((void**)&y, n)); ... kernel<<<dimGrid, dimBlock>>>(params); CHECK_CUDA_ERROR(cudaGetLastError()); ...
Please always read the task-specific hints first!
Make sure there is no other load on the machine that you use for benchmarking. Try uptime
and top
to see what is the current load and who is running what there.
Read the assembly code of the relevant part, see below for mode details.
Here is one example of a seemingly counterintuitive issue you may encounter: branch predication vs. prediction.
To enable AddressSanitizer, run make clean
and make DEBUG=2
— see README.md
for more details on make
parameters.
In our environment, AddressSanitizer output may be a bit hard to read; here is a quick tutorial that hopefully helps. I am using a buggy implementation of MF1 as an example. I have included only relevant parts of the messages:
$ make DEBUG=2 test ... ... ERROR: AddressSanitizer: stack-buffer-overflow on ... READ of size 4 at ... #0 0x4034a3 (.../mf1/mf-test+0x4034a3) ... Address ... is located at offset 36 ... This frame has 2 object(s): [32, 36) 'in' ...
So program mf-test crashed, because we were trying to read at an address that is outside the bounds of any array. Furthermore, it happened to be right after the end of array in, so probably we were simply trying to read beyond the end of array.
This happened in the code at address 0x4034a3, which we can translate to a source code line:
$ addr2line -e mf-test 0x4034a3 .../mf1/mf.cc:32
Apparently the problem was at line 32 of mf.cc, which in my case contained:
float v = in[i + j * nx];
Now it is easy to figure out precisely what went wrong; it turns out that the index i was one too large in certain cases.
In C and C++, memory allocation functions typically do not guarantee that memory is initialised with zeros. However, it is easy to forget to initialise newly allocated memory, and in many cases your program may accidentally work correctly as newly allocated memory often happens to contain all zeros.
To better detect bugs related to the use of uninitialised memory accesses on Linux, try to set the environment variable MALLOC_PERTURB_
, for example, as follows:
MALLOC_PERTURB_=191 make test
On Mac OS X you can try to set the environment variable MallocScribble
, for example, as follows:
MallocScribble=1 make test
In the Linux example, the value 191 is convenient as it will fill newly allocated memory with the value 64. If you interpret such values as doubles or floats, you will get reasonable values (other choices might accidentally hide some errors if you get e.g. very small values).
We will use MF1 here as an example. One part of its test suite is a small program called mf-test
which will call mf
function with some small inputs. Let us compile it with debug options and open it in the debugger, using GDB with its text user interface:
$ make clean $ make DEBUG=1 $ gdb -tui mf-test
Then use command b to set a breakpoint in function mf
and start the program with command r; it will execute until it reaches function mf
:
(gdb) b mf Breakpoint 1 at ...: file mf.cc, line ... (gdb) r Starting program: .../mf-test ... Breakpoint 1 ...
You should have a split-screen text user interface, with the upper half of the screen showing your source code and the lower half showing the GDB console. You can scroll the source code with arrow keys; highlighting shows which line will be executed next. You can now e.g. execute code line by line with command n:
(gdb) n (gdb) n ...
At any point you can also show the contents of any variable with command p:
(gdb) p ny ... = 1 (gdb) p in[0] ... = 0
We can continue execution until we reach the breakpoint again with c:
(gdb) c Continuing. Breakpoint 1 ...
Other highly useful commands include bt for showing the stack backtrace (who called us), and q for quitting GDB. See the manual for more information; the split-screen text user interface is documented in section TUI.
The makefiles provide two ways of outputting the assembly code produced by the compiler. For example, if you want to see the compiled version of mf.cc
, try the following commands:
make mf.asm1 make mf.asm2
Then open the file mf.asm1
or mf.asm2
in your text editor. Both of these try to produce somewhat readable assembly code, but it may depend on your luck which of these is more readable in your case.
Usually the assembly code is very long, and the most challenging part is finding the relevant part of it quickly. Here is one trick that you can use:
asm("#foo");
and asm("#bar");
. Typically, you would put these around the innermost for-loop that you want to study closer.make mf.asm1
.mf.asm1
in the text editor and search for the comments #foo
and #bar
.Another trick is to simply search for the relevant instruction. For example, you can search for vmulps
to find all places in which you are multiplying float8_t
vectors.