Vector elements differentiation
QPU works on vectors of elements. It can be floating point number or integer (let's ommit packed
values for a while). Each register is capable of storing 16-element vector.
When input data are processed by SIMD operation, it is processed in 16-element blocks.
It may be required to differentiate elements of that vector.
The basis for this purpose uses elem_num register.
Moving a value from this special register loads consecutive numbers to elements:
mov r0, elem_num
# result in r0 accumulator
# r0=[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
How to selectively make the operation only on 8 elements:
mov r1, 0.0 # moves 0.0 to all elements of the vector in r1 accumulator
and.setf -, elem_num, 0x8 # setting the flags according to AND operation on element number
mov.ifnz r1, 1.0 # move 1.0 only to 8 higher elements
# result in r1 accumulator
# elem: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# r1=[ 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0 ]
Observe that Z-flags (zero flags) are cleared only for 8 higher elements, because in binary format of their number
1 is on the same position as in the mask (8 = 1000b).
By manipulation of the mask, other elements may be selectively processed, e.g. mask=1 differentiates between
odd and even element, mask=12 selects 4 highest elements.
Code optimization
The general rules for QPU code optimization may be as follows:
- prepare one common program for all QPU you use (that fits in instruction cache),
- avoid serial dependencies of data,
- avoid dependencies between QPUs,
- avoid conflics of arguments or targets in instruction,
- properly allocate variables in register files A or B,
- often use accumulators for calculations,
- use both two paths of ALU whereever it is possible,
- effectively use SIMD and vector data organization,
- use packed data types in order to save registers (for integer data),
- use pipelined instructions after branch instruction.
In order to effectively use two-issues ALU, it is good to start from
simple version that uses only one ALU path (is not concurent). When it is well tested and correct,
try to 'pull-up'
instructions and merge them with previous instructions.
Make it as you reach dependency on previous result. Try to reorder and interleave
calculations in order to minimize dependencies and use accumulator, the provide result for next instruction,
as file registers don't.
Links
[1] Broadcom VideoCore IV 3D, Architecture Reference Guide - manufacturer's documentation of VideoCore
[2] Addendum to the Broadcom VideoCore IV documentation, Marcel Muller
[3] VideoCoreIV 3D, QPU Instructions by Marcel Muller
[4]
Hello_fft - FFT example on QPU cores.
It is also located in Raspbian images in /opt/vc/src/hello_pi/hello_fft.
[5] Discrete Wavelet Transform example on QPU cores.
[6] SHA256 calculation on QPU cores.