05 March, 2025

Speech coding in DMR, D-STAR and System Fusion

Example of a speech signal which transitions
from an unvoiced to a voiced sound
Digital VHF and UHF communications is increasingly popular among radio amateurs using DMR, D-STAR and System Fusion. Very few radio amateurs, however, have much understanding of the signal processing required to compress a speech signal so it can be transmitted with the low bit rates available. Here is an attempt at an explanation.

A communications quality speech signal is usually sampled at 8 kHz and represented by 8 bits. This corresponds to 64 kbps. In order to send digital speech over the same bandwidth as analog FM modulation on VHF or UHF, the bit rate must be significantly reduced.

There are many types of speech coders, but the one used in the DMR (Digital Mobile Radio), D-STAR and System Fusion systems is a commercial speech coder called AMBE. This vocoder (voice coder) can reduce the bit rate of a speech signal to somewhere between 2 and 9.6 kbps. Amateur radio systems operate with bit rates in the lower end. This means that the speech signal is coded so that it only constitutes 3-4% of the data amount that the uncoded signal would have needed. To achieve this, one needs to exploit properties of how the speech signal is generated in the interaction between the vocal cords and the oral cavity.

The first figure shows a section of a speech signal that involves the transition from an unvoiced sound, "s", to a voiced sound, "e" (as in "send"). The unvoiced sound may look like noise and in this case it is generated by friction as air flows out between the teeth. The voiced part has a characteristic repetition period of about 11 ms. This is the pitch with a frequency of the inverse of the period, i.e. about 90 Hz. Its source is oscillation in the vocal cords, as one feels when touching the Adam's apple during a voiced sound. The rapid oscillation between the pitch excitation pulses, is due to resonances in the oral cavity. 

A block diagram for how speech is generated in a speech decoder is shown in the next figure. The filter, which is an Infinite Impulse Response (IIR) filter, usually of order 10, is there to recreate the resonances in the oral cavity. 

AMBE, which stands for Advanced Multi-Band Excitation, is a model in which the input signal to the filter is divided into different frequency bands. Therefore, many inputs are shown. These inputs are there in order to reproduce both the vocal cord pulses and the effect of friction in various narrow passages in the oral cavity or between the teeth. All the parameters in the speech coder must be updated about 50 times per second to follow the dynamics of the signal when speaking.

The most important information in the voice usually has the vocal cord pulses as its source. At low bit rates, the part of the excitation that encodes this is therefore prioritized higher than the more random components that are also part of the sound image. This easily results in a buzzing, metallic sound in digital speech, where individual characteristics of the voices of the individual speakers tend to be lost. 

AMBE allows for variable bit rates, and therefore the speech quality may vary in the systems in use, depending on channel quality etc. As an example, DMR transmits voice at a bitrate of 2.45 kbps. This is very low compared to typical mobile phone bitrates which can be in the range of 6.5 to 12 kbps, and this explains much of the degradation in sound quality.

The principle of analyzing the voice signal to find the parameters is much more complicated than the recovery or synthesis shown here and is therefore not included.

This has been a brief description of the principles behind low bitrate speech coding. It was first written for the chapter on digital signal processing in the revised textbook for Norwegian radio amateurs due later this year.

No comments:

Post a Comment