Development and debugging QPU programs
Development and debugging programs for QPU cores seems to be difficult, because it is multicore, it is outside of host CPU,
the access to core registers and internal VideoCore memory is not straightforward, there is no well known debugging.
There must be something, but is not documented (there is mentioned BRPT interrupt in documentation).
Anyway, it is possible to manage it by proper methodology:
- minimal register and memory dump on request,
- simplified incremental aproach to programming,
- having reference result,
Minimal register dump may be implemented as a macro, hence it is easy to insert it in any location in the QPU program.
If you need to examine the registers, this macro can be added in some line in the source code.
Figure 6. Example of breakpoint inserted in the source code.
When you look in macro implementation it can look like below.
Figure 7. Breakpoint macro implementation.
The body of the macro uses two helping macros for dump of file registers (dump_regs)
and writing data from VPM to SDRAM (store_dump). They perform also necessary waits in order to
be synchronized with end of transfers. Full source code is avalable for download as a part of example of QPU programming project (see below) -
file qasm/breakpoint.qinc.
Figure 8. Breakpoint macro implementation (body).
The body of the macro (_breakpoint) writes all dumped registers (file registers and accumulators) to VPM (Vertex Pipe Memory) and stores it to
the common memory with CPU. Addresses of buffers were passed via uniforms and stored in two file registers, with symbolic names:
dbg_rfile_dump and dbg_accus_dump.
After necessary writes that macro exits QPU program, so CPU program is released and prints the contents of the buffers in a readable form.
An example of that log is presented below:
Figure 9. Example of QPU register dump - simple debugging technique.
NOTE
Above register dump is for one QPU core. It consists of 2 parts: accumulators and file registers contents (rb registers are only partialy
visible in this figure). Each register is printed horizontally, with consecutive 16 elements from left to right.
Accumulators are printed twice, as floating point numbers and hexadecimal 32-bit integer numbers. Special purpose accumulators r4 and r5
are omitted. For each file register
symbolic name is assigned and format of the data is specified in application.
In this case, incremental aproach to programming means that you start from a program for single QPU and small part of data
and incrementally scale up to multiple QPU cores and large data.
At the first phase you can develop the program in assembler, examine the QPU processor specific behavior.
If it is your first program for the QPU, for sure you will see that some instructions work different that you thought.
You will discover that some combinations of arguments are invalid, so the instructions are not so flexible one could expect.
There are also some dependencies between consecutive instructions that you need to guard, otherwise result will be incorrect
(e.g. written file register not available in next instruction).
This is also debug phase, when you correct the code. It will be not very difficult if you use registers dump.
The code may be optimized here, especially by using dual-issue architecture of ALU.
In next phase add a loop, that processes next bunch of input data. Macro that dumps registers terminates the QPU program, so its
not good for debug loop behavior. Nevertheless it is possible to workaroud that weakness. Copy the code in the loop, so you have
the first and second loop iteration in different location. In that way you will be able to examine loop behavior (i.e. to check if loop variables
are correctly changed from one iteration to another).
Now will be harder, you need to switch from one QPU program, to multiple QPUs. Start from the program for two QPUs, when it works
extend it for more QPUs. Keep in mind that you need one program for all QPUs, there will be variations but the code should be
one - QPU instruction cache memory has limited size!
Start from assigning different parts of data for each QPU. New item in this program will be a synchronization between cores.
Learn how it is done in example.
Here results should be written into VPM and after all QPU cores finished, to common memory with CPU.
In this way they will be available for verification.
Reference results
Even for one QPU debugging a program is more complex that for regular host processor (like ARM).
The problem is a large set of data and its vector nature. To overcome this problem reference program is recommended.
It can be written in C or C++ and should provide verification pattern: some input data and correct output
for key stages of calculation.
It is convenient to use textual form of that, so the result will be easily compared with one from QPU and put in registers dump.
The comparison can be done in some spreadsheet (OpenOffice Calc, etc.), so vector form will be visible.
Data can be transfered from log console to the spreadsheet via clipboard. Alternative way is to embedded the verification
in the application.
Figure 10. Example of SIMD16 verification in OpenOffice Calc.
QPU project example
Example of QPU programming can be dowloaded and used as basis for own experiments with QPU cores.
It calculates Discrete Wavelet Transform using four QPU cores and SIMD16 parallel computing.
This code contains methods described in this site, especially breakpoint macro can be useful for other programmers.
See README.txt in this package for more details.
This software was developed on the basis of hello_fft, by some modifications and adaptation, QPU code was written from scratch
but it uses some concepts from FFT example.
Please remember that this software is only a demonstration of the techniques and is provided "AS IS", without any warranty.
The licence terms you may find in source files.
Download Eclipse project of Discrete Wavelet Transform on QPU cores (includes QPU source code).