RWTH Aachen
University
Institute for Communication
Systems and Data Processing
Skip to content
Direkt zur Navigation
Home
  • Deutsch
  • English
Home

Speech and Audio Coding – Principles

Speech coding algorithms can be classified into waveform coders, vocoders and hybrid coders. These basic techniques are outlined below.

Waveform Coders

In the encoder, a reduction of the signal dynamics can achieved by a fixed or adaptive quantization. Better results are obtained if a (fixed or adaptive) prediction filtering, according to the correlation properties of the signal, is employed. Under certain presumptions, a prediction gain can be used to reduce the bit rate, if the prediction error (residual) signal is quantized instead of the original signal. The prediction filter parameters may be adapted using the reconstructed signal.

A well known example of a waveform codec is the ADPCM (Adaptive Differential Pulse Code Modulation) scheme that allows a good signal reconstruction with a signal-to-noise ratio (SNR) of 30 ... 35 dB at a typical bit rate of 32 kbit/s. The ITU-T standard G.726 is applied in the DECT (Digital European Cordless Telephony) system. For high quality ISDN applications, the wideband (7 kHz) standard ITU-T G.722 describes a subband codec where two subband signals (0-4 and 4-7 kHz) that are obtained by a quadrature mirror filterbank (QMF), are encoded using ADPCM schemes.

Waveform coders do not explicitly rely on any speech specific signal characteristics. For this reason, waveform coding is also well suited for general audio signals, e.g., music.

Vocoders

In vocoders, not the signal samples but the parameters of a source-filter speech model are quantized and transmitted, i.e., vocoders realize a purely parametric speech coding. The respective source-filter synthesis representation closely follows the model of speech production.

 

  • Synthesis section of a simple vocoder.

The time-varying synthesis filter corresponds to the vocal tract and may include a model of the acoustic tube and the lip radiation. In approximation, an all-pole model can be used. The usage of this filter corresponds to the principle of Linear Predictive Coding (LPC). The gain-scaled glottis excitation signal may contain periodic segments coming from an impulse generator, representing voiced sounds, or noise-like segments for unvoiced speech. Instead of switching between the voiced and unvoiced excitation, enhanced models use a (spectrally weighted) mixture of both types (e.g. Multi-Band Excitation, MBE).

Pure vocoders are particularly used for low bit rate application (below 0.5 bits per sample).

Hybrid Coders

This intermediate class between waveform coders and vocoders are dominating the state-of-the-art solutions of speech coders for medium bit rates (0.5 ... 2 bits per sample) and a high quality, with applications particularly in digital - wireline and mobile - communication systems. As in vocoders, the parameters of a synthesis LPC filter are quantized and transmitted as a side information (parameter channel). Moreover, periodic (voiced) portions of the speech signal are modelled by a second filter, called LTP (long-term predictive) filter, which is typically realized as a comb filter.

The excitation signal before the glottis is, however, obtained by the quantization of the prediction error signal, according to the spirit of waveform coders. Accounting for properties of the human ear, this quantization may be quite coarse.

Many variations of this class of adaptive predictive coders exists. The ETSI-GSM Full Rate codec (GSM 06.10) which is implemented in most of the GSM digital cellular networks is an example of the family of RELP (Residual Excited Linear Prediction) coders.

The majority of modern hybrid speech coders is based on the principle of linear-predictive analysis-by-synthesis coding also known as CELP (Code-Excited Linear Prediction).

  • Basic principle of a CELP encoder. The shaded blocks comprise the CELP decoder in the Analysis-by-Synthesis loop.

In a CELP coder, an optimum excitation signal vector is determined by a closed-loop criterion, evaluating the weighted error between the (original) input signal and the (decoded) output speech signal in a minimum mean sqare error (MMSE) sense. The perceptual weighting filter shapes the spectrum of the reconstruction error, exploiting the masking properties of the human ear. In formant regions of the short-term spectrum of the signal, larger error portions are permitted. Consequently, this type of coder not only uses source redundancies (i.e., short-term and long-term correlations) by means of the prediction filters, but also profits from irrelevancies due to the signal sink, the human ear.

The codebook, i.e., the excitation generator is known in both encoder and decoder. Usually, the excitation is composed of the sum of an adaptive codebook contribution (replacing the LTP filter) and a fixed codebook contribution. After having found the optimum excitation sequence, only the index of the selected entries of the fixed and adaptive codebooks has to be transmitted.
It is quite obvious that an exhaustive search of all possible excitation vectors of the (typically very extensive) codebook requires a very high complexity of the codec that often cannot be provided even by modern signal processors. Therefore, many structured codebooks have been investigated in the past, in order to overcome this complexity problem. Variants of analysis-by-synthesis coders are structures such as ACELP (Algebraic CELP), RPE (Regular Pulse Excitation), MPE (Multi Pulse CELP), VSELP (Vector-Sum Excited Linear Prediction).
Especially ACELP coders have been selected for several standards such as the GSM Enhanced Full Rate Codec (ETSI-GSM 06.60), the IS-641 codecs for the US TDMA system IS-136 or the ITU-T G.729 general purpose codec family. Further, the GSM Adaptive Multirate (GSM-AMR) and Adaptive Multirate Wideband codecs are based on ACELP technology.