------

Efficient SIMD technique with parallel Max-Log-MAP Algorithm for Turbo Decoders

------

Key words: Slice Turbo code, DSP, SIMD, decoder

------

Abstract:
This paper proposes a SIMD technique to implement efficiently the Max-Log-MAP algorithm on a DSP [2]. The Max-Log-MAP algorithm is the key component of a turbo-decoder.It performs the soft decoding of a convolutional code using two finite Viterbi recursions on a trellis: the forward and the backward recursions. The basic computations involved in a trellis stage computation are Addition-Comparison-Selection (ACS) operations. Loo et al. proposed to speed up the decoding throughput of the Max-Log-MAP algorithm on a DPS by using its SIMD functionality [3]. For DSP offering SIMD operations, parallel (vectorized) computations can be achieved simultaneously, on a set of data called a vector. Their SIMD technique performs for a given trellis stage several ACS operations. This method improves the decoding throughput but is not optimal. In fact, because of the structure of the trellis, the state metrics needs to be reordered into a single vector at the end of each trellis stage. This operation requires 2 instructions and reduces the efficiency of the SIMD technique. The use of this technique is efficient when all operations can be vectorized, i.e. when there is no need to pack/unpack data into a single vector to execute a SIMD operation. The new and original method proposed in the present paper aims to overcome this problem by vectorizing all operations involved in the Max-Log-MAP decoding for turbo decoders. It consists in using SIMD instructions to perform several independent trellises in parallel. Hence, each variable in the Max-Log-MAP algorithm is vectorized as a vector of m components: each component of the vector is associated to one of the m trellises decoded. This trellis parallelization is made possible by the use of an adapted two-dimensional turbo code proposed in [4]. This adapted turbo-code is constructed, in each dimension, as the parallel concatenation of m independent Circular Recursive Systematic Convolutional (CRSC) codes, called slices. The m slices are then decoded in parallel with a Max-Log-MAP using vectorized operations. Moreover, an appropriate parallel interleaver is also proposed in [4]. Thanks to its structure, the interleaver maintains the vector organization between the natural and the interleaved order. Thus, there is no need to unpack and repack the vectors between half-iterations. It has also been shown that this adapted turbo code introduces no performance degradation compared to a conventional turbo code. In the present paper, the adapted turbo-code is described with its interleaver properties. After a brief description of the Max-Log-MAP algorithm, the key architectural features for implementing the algorithm with SIMD functionality are given (memory organization and SIMD operations). The proposed trellis parallelization technique has been implemented on a DSP TMS320C6416: four parallel Max-Log-MAP algorithms are performed on 4 independent trellises. Implementation results shows that for a 600 MHz clock frequency, the parallel Max-Log-MAP algorithm achieves a throughput of 2560 Mbits/s. An accurate estimation for the complete turbo decoder with 8 decoding iterations gives a decoding throughput of 160 Mbits/s.

------

Full Paper: Click Here

------

Authors: David Gnaedig, Mathias Lapeyre, Florent Mouchoux, Emmanuel Boutillon

------

Referenceaccepted to GSPx 2004 Embedded Applications Software & Hardware Sept. 27-30, 2004, Santa Clara, CA USA.

------