## Study of combining GPU/FPGA accelerators for High-Performance Computing

Bruno Da Silva, An Braeken, Erik D'Hollander, Abdellah Touhafi, Jan G. Cornelis and Jan Lemeire

23/01/2013

ERASMUS

UNIVERSITEIT GENT

## **Overview**

- 1. HPC Desktop: CPUs + GPUs + FPGAs
- 2. Roofline performance of GPU/FPGA
- 3. Comparing GPU/FPGA for an image processing algorithm
- 4. GPU/FPGA collaboration: Pedestrian recognition

#### 5. Conclusions







# **1. High Performance** Desktop

Combining CPUs + GPUs + FPGAs







## HPC Desktop: CPU+FPGA+GPU

#### Architecture:

- CPU: Xeon E5506
- GPU: Tesla C2050
- FPGA: Pico Ex500 board with 2x Virtex6-LX240
- Toolchain:
  - Languages: C/C++ and OpenCL
  - High-Level Synthesis tools: ROCCC and VivadoHLS.
  - APIs: Nvidia Libraries(GPU) and Pico Computing framework (FPGA)







### HPC Desktop: CPU+FPGA+GPU



#### CPU/FPGA/GPU Heterogeneous Architecture









































## 2. Roofline Performance of GPU/FPGAs

Adapting the roofline model for hardware accelerators







## Roofline model: Computational Roofline (Gops/s) 10 Bandwidth rooffine Govees 151 Peak Performance (Gops/s) Computational Intensity (Ops/byte) Performance = Min(I/O dependent Perf., HW Peak Perf.) I/O dependent Perf. = Ops/Byte x Bytes/s = CI x BW where **CI** = Computational Intensity **BW** = I/O Bandwidth







### Roofline model:



Computational Intensity (Ops/byte)

- CI of algorithm  $\rightarrow$  results  $\rightarrow$  I/O or compute bound







## Superimposed GPU/FPGA roofline models for integer 32bits additions



# 3. GPUs vs FPGAs

Comparing an image processing algorithm







Implementing a morphological operation

- Basic morphological operation: Erosion 3x3





- First implementation: handwritten code
- Second implentation: HLS-compilers







#### Roofline model of erosion: Handwritten VHDL code



Implementing a morphological operation with ROCCC (<u>Riverside Optimizing Configurable C Compiler</u>)

- Why ROCCC?:
  - Open Source
  - Stream oriented
  - Optimization to decrease memory accesses:
    - Smart Buffers
    - Partial Loop Unrolling







## **Smart Buffers**

 The compiler analyses the array access looking for possible reuse between loop iterations to reduce the number of off-chip memory accesses.









## Partial loop unrolling



 In ROCCC, an output stream channel must be defined and the outputs must be multiplexed in time.







## Impact of the smart buffers and the partial loop unrolling over the Computational Intensity









# Improving performance by increasing the Computational Intensity









#### Roofline model of erosion: Increasing the original Computational Intensity



## **Resource Consumption**



## Improving performance by increasing parallelism









#### Roofline Model: Handwritten VHDL code vs ROCCC



#### **GPU vs FPGA Performance**



## 4. Combining GPUs and FPGAs

Exploiting the best of both technologies







## Pedestrian detection: fastHOG

- Detecting people in images using Histograms Oriented Gradients (HOG) and Support Vector Machines (SVM).
- Different Steps
  - Some ideal for GPU
  - Others ideal for FPGA
- Existing GPU version called fastHOG.









Vrije Universiteit Brussel

#### fastHOG: HOG + SVM









#### Identifying the candidate to accelerate: Histogram + SVM Computation



| , NVIDIA Visual Profiler                     |
|----------------------------------------------|
| ile View Run Help                            |
| 📸 🗑 🖳 🗖 🧠   🕀 Q 🏵   🛄 🛄                      |
| Geforce GTX 280                              |
| Process: 5484                                |
| E Thread: 1288                               |
| Runtime API                                  |
| Driver API                                   |
| [0] GeForce GTX 280                          |
| Context 1 (CUDA)                             |
| MemCpy (HtoD)                                |
| T MemCpy (DtoH)                              |
| MemCpy (DtoD)                                |
| Compute                                      |
| T 36.8% [31] linearSVMEvaluation()           |
| 30.4% [31] computeBlockHistogramsWithGauss() |
| T 12.9% [31] convolutionColumnGPU4to2()      |
| T 5.7% [31] convolutionRowGPU4()             |
| T 2.7% [31] resizeFastBicubic4()             |
| T 1.3% [31] normalizeBlockHistograms()       |
| 🝸 0.3% [33] memset (0)                       |
| ▼ 0.1% [1] uchar4tofloat4()                  |
| <ul> <li>Streams</li> </ul>                  |
| Stream 1                                     |
|                                              |







#### Identifying the candidate to accelerate: Histogram + SVM Computation











#### Identifying the candidate to accelerate: Histogram Computation and Normalization











## New dataflow computing HOG on the FPGA









## Adapting the code for FPGAs

- Mathematical types, functions and footprint:
  - Floating point  $\rightarrow$  Fixed point
  - Adapt mod, floor, divisions and other computationally expensive operations
  - Adjust the data bit-width: performance vs accuracy
- Memory use and reuse:
  - Rewrite the code to reduce memory accesses
- VivadoHLS Directives:
  - Pipelining, Stream interface, Partial Loop Unrolling
  - Impact of the clock definition over the design.







## Adapting the code working @ 125MHz

#### Floating point

#### Fixed point

Latency:

160,128,522 clock cycles

Latency: 61,055,286 clock cycles

• Resource consumption:

|                    | BRAM<br>18K | DSP48 | FF     | LUT    | SLICE |
|--------------------|-------------|-------|--------|--------|-------|
| Total              | 17          | 50    | 6048   | 6930   | 0     |
| Available          | 832         | 768   | 301440 | 150720 | 37680 |
| Utilization<br>(%) | 2           | 6     | 2      | 4      | 0     |

#### Resource consumption:

|                    | BRAM<br>18K | DSP48 | FF     | LUT    | SLICE |
|--------------------|-------------|-------|--------|--------|-------|
| Total              | 16          | 62    | 6071   | 11821  | 0     |
| Available          | 832         | 768   | 301440 | 150720 | 37680 |
| Utilization<br>(%) | 1           | 8     | 2      | 7      | 0     |







## Adapting the code

#### Floating point







 Operation Latency: 12 clock cycles

- Operation Latency:
   5 clock cycles
- It is important to avoid divisions, mod or floor operations due to its high latency and resource consumption.







## Adapting the data bit-width

- Knowledge of the input data range.
- The input gradients are composed of two parameters:

Magnitude: between [0, 17] ------ 5 bits for the integer part

- The resource consumption as well as the accuracy is decreased due to the fixed point conversion.
- However, it allows to place more blocks in parallel and to exploit the I/O bandwidth.







## Adapted HOG code: Main function



#### Impact of the directives: Pipelining the code



Pipeline the full code is not always the best option.







#### Impact of the directives: Partial pipelining







Latency: 7,954,266 clock cycles

Resource consumption:

|                    | BRAM<br>18K | DSP48 | FF     | LUT    | SLICE |
|--------------------|-------------|-------|--------|--------|-------|
| Total              | 16          | 65    | 6078   | 11664  | 0     |
| Available          | 832         | 768   | 301440 | 150720 | 37680 |
| Utilization<br>(%) | 1           | 8     | 2      | 7      | 0     |



## Impact of the directives: Partial Loop Unrolling

 Latency: 8,069,910 clock cycles

#### Resource consumption:

|                    | BRAM<br>18K | DSP48 | FF     | LUT    | SLICE |
|--------------------|-------------|-------|--------|--------|-------|
| Total              | 16          | 65    | 6137   | 11735  | 0     |
| Available          | 832         | 768   | 301440 | 150720 | 37680 |
| Utilization<br>(%) | 1           | 8     | 2      | 7      | 0     |





FullStreamHOG





#### Impact of the clock definition on VivadoHLS

- Once the clock of the system has been defined (mandatory to synthesize) the compiler would focus all the effort to achieve the expected frequency.
- That means a high resource consumption and a extremely low latency.
- Specially when several directives as pipelining are applied.
- The best strategy is to define the operational frequency from the beginning.







## HOG implementation

Pipeline solution including the normalization part



Latency drastically reduced (about x35):
 Original design: 2,1s Final design: 61ms







#### Execution on the FPGA: Streaming + Pipelining









Brussel

Improving performance by increasing the parallelism up to the maximum resources available on the FPGA









#### Comparing FPGA/GPUs HOG computation

|            | Speed Up o<br>over the Tes | of the FPGA(s)<br>sla C2050 GPU | Speed Up of the FPGA(s)<br>over the Geforce GTX280 GPU |        |  |
|------------|----------------------------|---------------------------------|--------------------------------------------------------|--------|--|
| Iterations | 1xFPGA                     | 2xFPGA                          | 1xFPGA                                                 | 2xFPGA |  |
| 92x69      | 37%                        | 68%                             | 85%                                                    | 93%    |  |
|            |                            |                                 |                                                        |        |  |
| 20x15      | 17%                        | 59%                             | 77%                                                    | 88%    |  |
|            |                            |                                 |                                                        |        |  |
| Average    | x1.61                      | x3.22                           | x6.45                                                  | x13.69 |  |









## Impact of the PCIe protocol overhead



Size of data [Bytes]







## Impact of the PCIe protocol overhead









## Impact of the PCIe protocol overhead









## Performance combining GPU/FPGA

|                         | GPU*  | GPU* + FPGA<br>16HOGs | GPU* + 2xFPGA<br>16HOGs |
|-------------------------|-------|-----------------------|-------------------------|
| 92x69                   | 6547  | 7260                  | 5198                    |
|                         |       |                       |                         |
| <b>20x15</b>            | 467   | 1846                  | 1653                    |
|                         |       |                       |                         |
| Total<br>Execution [ms] | 73066 | 108738                | 86208                   |

\* Tesla C2050







## So, when to combine?

- In our case, when the FPGA implementation speed up the design more than 60% compared to the GPU.
- And when the amount of data to transfer is higher enough to reach the maximum PCIe bandwidth.







### Exploiting our modular Pico Board: HOG + SVM









## 5. Conclusions







## Conclusions of our HLS experience

- For algorithms with low CI, partial loop unrolling and other optimizations (smart buffers) are able to increase the CI have obtained higher performance.
- For algorithms with high CI, the most important is the resource consumption, which determinates the maximum realizable parallelism.
- In both cases, to exploit the FPGA's features it is recommended to pipeline the stages and to stream the I/O.
- HLS tools allow further and better tuning than handwritten code.
- Still, the code must be rearranged to maximize performance.





