The Sound of Silence: How Traditional and Deep Learning Based Voice Activity Detection Influences Speech Quality Monitoring
|Title:||The Sound of Silence: How Traditional and Deep Learning Based Voice Activity Detection Influences Speech Quality Monitoring||Authors:||Jaiswal, Rahul; Hines, Andrew||Permanent link:||http://hdl.handle.net/10197/11367||Date:||7-Dec-2018||Online since:||2020-05-05T14:51:49Z||Abstract:||Real-time speech quality assessment is important for VoIP applications such as Google Hangouts, Microsoft Skype, and Apple Face-Time. Conventionally, subjective listening tests are used to quantify speech quality but are impractical for real-time monitoring scenarios. Objective speech quality assessment metrics can predict human judgement of perceived speech quality. Originally designed for narrow-band telephony applications, ITU-T P.563 is a single-ended or non-intrusive speech quality assessment that predicts speech quality without access to a reference signal. This paper investigates the suitability of P.563 in Voice over Internet Protocol (VoIP) scenarios and specifically the influence of silences on the predicted speech quality. The performance of P.563 was evaluated using TCD-VoIP dataset, containing speech with degradations commonly experienced with VoIP. The predictive capability of P.563 was established by comparing with subjective listening test results. The effect of pre-processing the signal to remove silences using Voice Activity Detection (VAD) was evaluated for five acoustic feature-based VAD algorithms: energy, energy and spectral centroid, Mahalanobis distance, weighted energy, weighted spectral centroid and four Deep learning model-based VAD algorithms: Deep Neural Network, Boosted Deep Neural Network, Long Short-Term Memory and Adaptive context attention model. Analysis shows P.563 prediction accuracy improves for different speech conditions of VoIP when the silences were removed by a VAD. The improvements varied with input content highlighting a potential to switch the VAD used based on the input to create a content aware speech quality monitoring system.||Funding Details:||European Commission - European Regional Development Fund
Science Foundation Ireland
|Type of material:||Conference Publication||Copyright (published version):||2018 the Authors||Keywords:||Speech quality; Voice activity detection||Other versions:||http://ceur-ws.org/Vol-2259/||Language:||en||Status of Item:||Peer reviewed||Is part of:||Brennan, R. B., Beel, J., Byrne, R., Debattista, J. and Crotti Junior, A. (eds.). Proceedings for the 26th AIAI Irish Conference on Artificial Intelligence and Cognitive Science Trinity College Dublin||Conference Details:||The 26th Irish Conference on Artificial Intelligence and Cognitive Science, Trinity College Dublin, Ireland, 6-7 December 2018|
|Appears in Collections:||Computer Science Research Collection|
Show full item record
This item is available under the Attribution-NonCommercial-NoDerivs 3.0 Ireland. No item may be reproduced for commercial purposes. For other possible restrictions on use please refer to the publisher's URL where this is made available, or to notes contained in the item itself. Other terms may apply.