Context

One of the most difficult tasks in speech processing is to define limits of the phonetic units present in the signal. Phones are strongly co-articulated and there are no clear borders among them, so the link between the linguistic and the acoustic segmentation is not simple to define. It does not matter which code level is chosen (word, syllable, phon): the acoustic variability of speech signal makes difficult all alignment efforts and its ambiguity challenges all proposed definitions.

Overview

We have developed an audio segmentation algorithm based on the acoustic signal level. If you are interested by python or C code, please contact us.

Forward backward segmentation

This segmentation is provided by the Forward-Backward Divergence algorithm, which is based on a statistical study of the acoustic signal.

Assuming that speech signal is described by a string of quasi-stationary units, each one is characterized by an auto regressive (AR) Gaussian model. The method consists in performing a detection of changes in AR models. Two AR models M0 and M1 are identified at every instant, and the mutual entropy between these conditionnal laws allows to quantify the distance between them. M0 represents the segment from last break in the signal while M1 is a short sliding window starting also after last break. When the distance between models change more than a certain limit, a new break is declared in the signal. The algorithm detects three sorts of segments: shorts or impulsive, transitory and quasi-stationary. We show here under an example of the segmentation where infra-phonetic units are determined.

Results of speech segmenting algorithm

The use of an a priori segmentation partially removes redundancy for long sounds, and a segment analysis is relevant to locate coarse features. This approach have given interesting results in automatic speech recognition: experiments have shown that segmental duration carry pertinent information.

Contributors