Beiträge zur statistischen Modellierung und effizienten Dekodierung in der automatischen Spracherkennung
The thesis deals with different aspects of automatic speech recognition. After an introduction, which describes the most important fundamental ideas, methodologies and algorithms, some new approaches are outlined and evaluated, which aim for the optimization of the acoustic modeling component of a speech recognition system. The target is the fine adjustment of the selected modeling structure to the quantity and type of the available acoustic training data. In experimental investigations on internationally known speech recognition tasks the presented new modeling scheme outperforms conventional systems by approximately 10% in recognition performance. In addition, the approach of tree-based clustering of context-dependent model states is extended in such a way that the specification of phonetic categories can be avoided. The recognition system clustered with the help of this procedure achieves a similar recognition performance as the best systems of the official evaluation of the Wall Street Journal large vocabulary recognition task with 5,000 words. Furthermore, discriminative training procedures for acoustic modeling are discussed and evaluated. The approach of vocabulary-based discriminative training is proposed and the extension to vocabulary- and language model-based training is outlined in detail. The experimental results prove the suitability of the approach for better parameter estimates in contrast to Maximum-Likelihood training and the conventional frame-based discriminative training. Additionally, new hybrid recognition systems with a discriminatively trained preprocessing are presented. The hybrid recognition system with context-depending modeling set up in the experiments with the Resource Management database achieves one of the best ever reported error rates obtained with comparable systems. In the following paragraph, the two most common forms of organizing the decoding procedure are presented and the contributions of the author within this area are presented and evaluated. Time-synchronous Viterbi-decoding with a tree-structured recognition network that makes use of partial tree copies and language model smearing proved to be a powerful and efficient decoding approach in case of a bi-gram language model. With the proposed A-Posteriori pruning and A-Posteriori-Lookahead pruning a further acceleration of the decoding can be achieved, which only causes a relatively small additional search error. Moreover, the principle of decoding with stacks is illustrated, which is of great advantage when making use of language models of higher context depth. The developed stack-decoder "DUcoder" is introduced. In evaluations, decoding with a 95,000 words vocabulary and a tri-gram language model in almost real-time is achieved. This, however, still comes along with a substantial search error. Finally, the German large vocabulary speech recognition system "DuDeutsch" developed by the author is presented. It allows the speaker-independent and the speaker-dependent recognition with a vocabulary of up to 95,000 words. For acoustic modeling the clustering and structure optimization procedures presented in the thesis are applied; decoding is performed with the presented stack-decoder. The speaker-dependent models are gained from the speaker-dependent ones using adaptation techniques. The proposed discriminative adaptation approach results in approximately 15% improved error reduction compared to the common Maximum-Likelihood approach.