### HLS 4 HPC: Some Missing Links

George A. Constantinides<sup>1</sup>

Imperial College London

January 22, 2013

1Thanks to: Dr Sam Bayliss, Dr David Boland

George A. Constantinides HLS 4 HPC: Some Missing Links

### Outline

- A simple HLS example
- What are the missing links?
  - Part 1: Representing real numbers
    - Benefits
    - Number systems
    - Precision
  - Part 2: Dealing with external memory
    - The central role of SDRAM
    - Using the polyhedral model
- Some open questions

```
#define N 1024
#define L 2
void main()
{
  double x[ N ], y[ N - 1 ];
  double k[L] = \{0.12, 0.2\};
  for( int i = 0; i < N - 1; i++ ) {</pre>
    y[i] = 0.0;
    for( int j = 0; j < L; j++ )</pre>
      v[i] += k[j] * x[i - j + L - 1];
  }
}
```



- Structural concerns: parallelization, memory subsystem design. External memory?
- Numerical concerns: is double the right number representation?

## Part I

Numerics

George A. Constantinides HLS 4 HPC: Some Missing Links

・ロト ・日下・ ・日下

문 🛌 문

## Missing Link 1: Representing the Reals

- Hardware represents numerical data as bit strings.
- A bit string of length *n* can represent at most 2<sup>*n*</sup> distinct values.
- Representations vary in how these strings are mapped onto reals:  $f : \{0,1\}^n \to \mathbb{R}$ .
  - Floating-point (s,m,e),
  - Fixed-point (s,m)/2c/1c,
  - LNS (s,e),
  - RNS (m,m,...), etc.
- We have always cared about using 'enough' precision. Now we should care about using 'just enough' precision.
  - Lower precision  $\Rightarrow$  smaller area units, less bandwidth  $\Rightarrow$  more units, more transfer  $\Rightarrow$  faster.

### Hardware Implications



(from http://www.cise.ufl.edu/~mssz/CompOrg/CDA-arith.html)

・ロト ・回ト ・ヨト ・ヨト

### Numerics Matter!





George A. Constantinides HLS 4 HPC: Some Missing Links

#### The Central Problem

 $\max_{\substack{s,p\\ subject to : \forall i. Pre(i) \rightarrow Out(s, p, i) \in Accept(i).}$ 

(1)

- Here *s* denotes circuit structural parameters, and *p* denotes circuit precisions. Note plural we are in a parallel environment, so specialise!
- *i* denotes inputs, *Pre(i)* denotes precondition predicate, *Out(s, p, i)* denotes output, *Accept(i)* denotes acceptable outputs.
- For simplicity, let's restrict ourselves to fixed *s*.

# The General Setting

Toy Problem

A toy sample problem would be

$$\begin{array}{l} \max_{p \in \mathbb{Z}_{++}} \operatorname{perf}(p) \\ \text{subject to:} \quad \forall (a, b) \in \mathbb{R}^2. m \le a \le M \land m \le b \le M \rightarrow \quad (2) \\ \forall \delta \in \mathbb{R}. |\delta| \le 2^{-p} \rightarrow |\frac{\delta}{a+b}| < \epsilon. \end{array}$$

This corresponds to a fixed-point addition of a and b with a condition on the relative error of the result.

Reformulation of Toy Problem

$$\max_{p \in \mathbb{Z}_{++}} \operatorname{perf}(p)$$
  
subject to:  $\neg \exists (a, b, \delta) \in \mathbb{R}^3 . m \le a \le M \land m \le b \le M \land$ 
$$-2^{-p} \le \delta \le 2^{-p} \land \delta^2 - \epsilon^2 (a+b)^2 \ge 0.$$
(3)

Notice that the feasibility condition defines a semi-algebraic set. Applying a standard quantifier elimination gives the following simplified problem, assuming m < M,  $\epsilon > 0$ .

#### Simplified Problem

# **Objective Function**

- In general, 1/*perf* can be approximated as a (multi-variate) polynomial.
- For our toy example, if perf is monotonically decreasing in p, then  $p^* = \lfloor -\log_2 \epsilon |m| \rfloor$  for m > 0 and M > 0 and  $p^* = \lfloor -\log_2 \epsilon |M| \rfloor$  for M < 0.
- One can imagine that with multivariate *p* (due to parallel implementation) and more complex examples, things get difficult!

We have been focusing on:

- The feasibility problem.
- The algorithm is defined by fixed combinations of the four basic operators \*, +, /, -.
- Accept(i) is defined by f(i) + [I, u].
- Convex relaxations leading to computationally tractable formulations.
- Proof techniques resulting in simple machine checkable proofs.

# Part II

Numerics

George A. Constantinides HLS 4 HPC: Some Missing Links

æ

≣ ।•

・ロト ・日下 ・ 日下



イロト イポト イヨト イヨト



イロト イポト イヨト イヨト



George A. Constantinides HLS 4 HPC: Some Missing Links

Image: Image:

< ∃ >

- ₹ 🖬 🕨

# **SDRAM** Memory Characteristics

- Advantage : SDRAM is **cheap**, high capacity, commodity memory
- Disadvantage : Internal device architecture means high latency (20-30 clock cycles in FPGA)
- Physical device structure imposes timing constraints
  - Explicit 'activation' of a row before data is read from it
  - Explicit 'precharge' of a row before another row is 'activated'
- Significant bandwidth difference improvements possible when reordering external memory transactions
  - Typically  $> 5 \times$  bandwidth difference between optimal and worst-case performance

### Use the Polyhedral Model

#### Affine Loop Nests



イロト イポト イヨト イヨト

 Augmented matrix captures each loop iteration and its associated SDRAM row (r) and burst (u)





### What was the impact on bandwidth efficiency?

**Bus Utilization** 



#### Numerics

- Natural and scalable methods to deal with iterative algorithms where loop conditions depend on rounded variables.
- Techniques to deal with other source of error, *e.g.* overclocking.
- Memory
  - Effective commercial tool integration
  - Optimizing SDRAM refresh for latency guarantees
  - Optimal bank partitioning