Programming QPU

More about ELESOFTROM Company

ELESOFTROM Company is specializing in firmware development and fixing for embedded systems.

We have more than 10-years of experience in that area.

In developed tasks and projects we value reliability and effectivness of the firmware.

Fixing the Software for Emebedded Systems

Software for Microcontrollers

Home page

Full offer

Experience

About company

Contact

DioneOS - RTOS for embedded devices

ELESOFTROM developed RTOS for ARM Cortex-M3 and msp430.

The system is optimized for short execution time, having short switching time between threads.
The system provides elements (e.g. semaphores, mutexes, timers, queues etc.) used for building multi-threaded firmware.

Cortex version has full tests coverage and was tested by automatic testing framework.

DioneOS home page

Reliability

Performance

DioneOS documentation

Tutorials about DioneOS

^ Blog index << Broadcom VideoCoreIV 3D, IDE and Tools >> KONIK Logic Game

QPU Programming

2018-01-04 Piotr Romaniuk, Ph.D.

Contents

Introduction
Variables
Subroutine call
Relocable code
Synchronization of the QPUs
Instructions pipeline
Vector elements differentiation
Code optimization
Links

Introduction

Broadcom VideoCoreIV-3D is a graphic processor (GPU) that contains multiple cores, Quad Processing Units (QPUs). Programming these cores is very specific task, because it requires to deal with multicore parallelism as well as SIMD hardware. The architecture of single QPU is based on dual-issue ALU, that performs concurently two operations on vectors of floating point numbers.
In order to learn how to program QPUs, following sources are recommended:

Broadcom VideoCoreIV-3D documentation,

source code examples (the most valuable is hello_fft, together with site that describes it)
Addendum to Broadcom Documentation and accompanying sites by Marcel Muller.

Browsing and analysing hello_fft example is very valuable, there are used techniques and programming tricks specific for QPU.
Let's explain them below and see how they apply known programming concepts on QPU.

Variables

Each QPU has two large sets of vector registers (together there are 64 registers). This resource is per QPU local memory that can be used for variable allocation. The variables are accessible from any part of the single QPU program, so they resemble global variables in high level programming.
It is good practice, to have separated source file with an assignment symbolic names of the variables to particular registers. For clarity of your program use meaningful names for variables, e.g. in_data, qpu_id, counter instead of a, b, x. In the program you can use these names like variables. This assignment is hardcoded and static, only a few registers can be reused in a program. The registers that do not keep values for whole program execution time.
While number of the register in the set can be freely chosen, selection set A or B is determined by the program, where the variable is used (precisely the QPU instruction). Remember that these two sets are not symmetric:

only set A has packing/unpacking feature,
only register from set A can be used as jump destination,
registers from set B cannot be used together with small immediate value in one instruction.

There is more constraints that influense the variables assignment:

one instruction cannot access more that one register from each set
results of concurrent ALU paths must be written to different set

Here are examples of incorrect code, that breaks above constraints:

        add ra0, ra1, ra2 ;                # wrong, input registers ra1 and ra2 are from the same set
        add ra0, ra1, rb1 ; mov rb0, ra3   # wrong, input registers ra1 and ra3 are from the same set
        add ra0, rb0, 2   ; mov ra2, 100   # wrong, results of two ALU paths goes to the same set (ra0, ra2)

NOTE: semicolon splits parts of instruction for two paths of ALU

Don't worry, it will be signalled during compilation. Just change the assignment, or use accumulators and separated load/store to register file. The general rule is: often use accumulator for local calculations, keep results in file registers.

Subroutine call

There is no stack in the QPU hardware but it deals with subroutine call (function call). The solution is similar to ARM link register idea, but here register can be chosen.

        brr ra0, r:subroutine          # ra0 as link register
        nop
        nop
        nop
                                #--- here is return address stored in ra0
        ...
        
:subroutine                     # Subroutine start 
        
        ...
        bra -, ra0              # return from subroutine
        nop
        nop
        nop

NOTE: Three extra nops are added after jump because QPU instruction pipeline is not flushed when jump is perfomed.

It is possible to use nested calls but call tree must be planned, and hardcoded. Different link registers must be used on each nesting level.

Relocable code

Programmer has no control where QPU code is loaded into memory. Application just request the memory, that is allocated from special region by driver and its address is returned. Hence, the requirement for relocable code. If there is an absolute address in the code there are two options to correct it: by change it in compiled code, just before loading to allocated memory or by determination the address in run time in QPU program. The latter method is more interesting because it does not require any modification of the code:

.set r_proc1, ra0              # r_proc1 symbolic name for register that will store address of proc1
.set r_link,  ra1              # r_link is symbolic name for register that stores return address

        brr r_proc1, r:1f      # this jump stores proc1 absolute address in r_proc1 register
        nop
        nop
        nop
:proc1                                
        # proc1 code here
:1      
        ...
        
        bra r_link, r_proc1     # call proc1 by absolute address

The code jumps over proc1 subroutine and loads address of this procedure into file register. Note, that target of the jump is not important. If there are more such procedures, that need to be called by absolute address, then all of them can be determined in this way.
Why absolute call is important? Because it is a kind of pointer to function and can be diffrentiated for each QPU. This is a method to make general common code with variation points specific for QPUs. On the basis of QPU identification number (passed via uniforms) it is easy to select different procedure versions.
But why not to prepare diffrent program for different QPU? Remember that QPU has instruction cache memory, that is relatively small - 4KB only, and it is common for 4 QPUs located in one slice. If the programs for them will not fit into that cache, and QPUs starts to compete, effectiveness of calculation will be degraded. That is why one program for all QPU is better.

Synchronization of the QPUs

QPU has common units in one slice, like Texture and Memory Lookup Unit - TMU, Special Function Unit - SFU. Vertex Pipe Memory (VPM) is also shared but for all QPUs.
TMU is prepared for access from multiple cores, it even has separated queue for each QPU.
SFU must be guarded, it returns the result in 3rd instruction after writting the argument, so it should not be touched before that.
VPM is the memory where results of calculations from all QPUs are gathered. The access to VPM is sequential and must be configured. In order to make it more flexible for each QPU configuration is often sent just before writing the values. That is the situation where QPU should be synchronized, just to not interfere each other. The second requirement for synchronization comes from a need of writing back the data from VPM to memory shared with CPU. It can be done only after all QPUs have written the complete data. This write should be perforrmed by only one QPU. It may be distinguished somehow and made responsible for this task.
There are two types of synchronization objects:

one mutex,
16 counting semaphores (counting up to 16).

Access to VPM can be guarded by the mutex, while final wait for all QPU may be handled by one counting semaphores (see DWT example [5]). It is also possible to have better control over synchronization (see hello_fft [4], where QPUs are released in specified order).

Instruction pipeline

QPU has four items pipeline. It is manifested when jump is taken, because the pipeline is not flushed:

        brr -, r_link      # return from some subroutine
        srel -, 7          # this instruction and two following are executed before return      
        nop
        nop

This looks strange, but means that three instruction after branch instruction are executed as they were before it. It happens because they remain in the instruction pipeline when jump is executed.
Existence of pipeline is also visible in file register case. It cannot be accessed in next instruction after write, that is because it is not ready yet. Similarly, accumulator cannot be vector-rotated if it was written in previous instruction.

Vector elements differentiation

QPU works on vectors of elements. It can be floating point number or integer (let's ommit packed values for a while). Each register is capable of storing 16-element vector.
When input data are processed by SIMD operation, it is processed in 16-element blocks. It may be required to differentiate elements of that vector. The basis for this purpose uses elem_num register. Moving a value from this special register loads consecutive numbers to elements:

        mov     r0, elem_num
        
 # result in r0 accumulator
 # r0=[ 0, 1, 2, 3,  4, 5, 6, 7,  8, 9, 10, 11,  12, 13, 14, 15]

How to selectively make the operation only on 8 elements:

        mov       r1, 0.0           # moves 0.0 to all elements of the vector in r1 accumulator
        and.setf  -, elem_num, 0x8  # setting the flags according to AND operation on element number 
        mov.ifnz  r1, 1.0           # move 1.0 only to 8 higher elements
        
 # result in r1 accumulator 
 # elem: 0    1    2    3     4    5    6    7     8    9    10  11    12   13   14    15
 # r1=[ 0.0, 0.0, 0.0, 0.0,  0.0, 0.0, 0.0, 0.0,  1.0, 1.0, 1.0, 1.0,  1.0, 1.0, 1.0, 1.0 ]

Observe that Z-flags (zero flags) are cleared only for 8 higher elements, because in binary format of their number 1 is on the same position as in the mask (8 = 1000b). By manipulation of the mask, other elements may be selectively processed, e.g. mask=1 differentiates between odd and even element, mask=12 selects 4 highest elements.

Code optimization

The general rules for QPU code optimization may be as follows:

prepare one common program for all QPU you use (that fits in instruction cache),
avoid serial dependencies of data,
avoid dependencies between QPUs,
avoid conflics of arguments or targets in instruction,
properly allocate variables in register files A or B,
often use accumulators for calculations,
use both two paths of ALU whereever it is possible,
effectively use SIMD and vector data organization,
use packed data types in order to save registers (for integer data),
use pipelined instructions after branch instruction.

In order to effectively use two-issues ALU, it is good to start from simple version that uses only one ALU path (is not concurent). When it is well tested and correct, try to 'pull-up' instructions and merge them with previous instructions. Make it as you reach dependency on previous result. Try to reorder and interleave calculations in order to minimize dependencies and use accumulator, the provide result for next instruction, as file registers don't.

Links

[1] Broadcom VideoCore IV 3D, Architecture Reference Guide - manufacturer's documentation of VideoCore
[2] Addendum to the Broadcom VideoCore IV documentation, Marcel Muller
[3] VideoCoreIV 3D, QPU Instructions by Marcel Muller
[4] Hello_fft - FFT example on QPU cores. It is also located in Raspbian images in /opt/vc/src/hello_pi/hello_fft.
[5] Discrete Wavelet Transform example on QPU cores.
[6] SHA256 calculation on QPU cores.