Deprecated: $wgMWOAuthSharedUserIDs=false is deprecated, set $wgMWOAuthSharedUserIDs=true, $wgMWOAuthSharedUserSource='local' instead [Called from MediaWiki\HookContainer\HookContainer::run in /var/www/html/w/includes/HookContainer/HookContainer.php at line 135] in /var/www/html/w/includes/Debug/MWDebug.php on line 372
Audio-visual speech recognition using MPEG-4 compliant visual features - MaRDI portal

Audio-visual speech recognition using MPEG-4 compliant visual features (Q1424526)

From MaRDI portal





scientific article; zbMATH DE number 2058708
Language Label Description Also known as
English
Audio-visual speech recognition using MPEG-4 compliant visual features
scientific article; zbMATH DE number 2058708

    Statements

    Audio-visual speech recognition using MPEG-4 compliant visual features (English)
    0 references
    0 references
    0 references
    0 references
    0 references
    16 March 2004
    0 references
    Summary: We describe an audio-visual automatic continuous speech recognition system, which significantly improves speech recognition performance over a wide range of acoustic noise levels, as well as under clean audio conditions. The system utilizes facial animation parameters (FAPs) supported by the MPFG-4 standard for the visual representation of speech. We also describe a robust and automatic algorithm we have developed to extract FAPs from visual data, which does not require hand labeling or extensive training procedures. The principal component analysis (PGA) was performed on the FAPs in order to decrease the dimensionality of the visual feature vectors, and the derived projection weights were used as visual features in the audio-visual automatic speech recognition (ASR) experiments. Both single-stream and multistream hidden Markov models (HMMs) were used to model the ASR system, integrate audio and visual information, and perform a relatively large vocabulary (approximately 1000 words) speech recognition experiments. The experiments performed use clean audio data and audio data corrupted by stationary white Gaussian noise at various SNRs. The proposed system reduces the word error rate (WER) by \(20\%\) to \(23\%\) relatively to audio-only speech recognition WERs, at various SNRs (0--30\,dB) with additive white Gaussian noise, and by \(19\%\) relatively to audio-only speech recognition WER under clean audio conditions.
    0 references

    Identifiers