Now showing 1 - 4 of 4
  • Publication
    Data-driven Quality of Experience for Digital Audio Archives
    (University College Dublin. School of Computer Science, 2022)
    The digitization of sound archives began to safeguard records that naturally deteriorate due to the irreversible chemical processes of the sound carriers. The digitization process has improved the usability and accessibility of audio archives and provided the possibility of using digital restoration. Assessing the quality of digitization, restoration, and audio archive consumption is essential for evaluating sound archive practices. The state-of-the-art in digitization, restoration, and consumption of audio archives has neglected quality assessment approaches that are automatic and take into account the user's perspective. This thesis aims to understand and define the quality of experience (QoE) of sound archives and proposes data-driven objective metrics that can predict the QoE of music audio archives in the absence of human listeners. The author proposes a paradigm shift to deal with the problem of quality assessment in sound archives by focusing on quality metrics for musical signals based on deep learning which are developed and evaluated using annotations obtained with listening tests. The adaptation of the QoE framework for audio archive evaluation is proposed to consider the user's perspective and define QoE in sound archives. The author, in a case study of audio archive consumption, proposes a curated and annotated dataset of real-world music recordings from vinyl collections and three objective quality metrics. The thesis shows that annotating a dataset with real-world music recordings requires a different approach to prepare the stimuli and proposes a technique based on stratified random sampling from clusters. The three proposed quality metrics are based on learning feature representations with three different tasks: degradation classification, deep convolutional embedded clustering (DCEC), and self-supervised learning (SSL). The first two tasks are proposed using an architecture based on framewise convolutional neural networks, while the SSL task is based on pre-training and fine-tuning wav2vec 2.0 on musical signals. This thesis demonstrates that degradation classification, DCEC, and wav2vec 2.0 learn useful musical representations for predicting the quality of vinyl collections. More specifically, the proposed metrics overcome two baselines when fine-tuning small annotated sets. The author also proposes a new correlation-based feature representation for classifying audio carriers, which overcomes the raw feature representations in terms of speed and feature dimensionality. Classifying audio carriers can be used as a pre-step of the quality metrics mentioned above when predicting the quality of multiple collections. The significance of the proposed work is that audio archive metadata can be enriched by providing quality labels using the proposed metrics. Overall, the thesis encourages scholars and stakeholders to a paradigm shift when evaluating the quality of sound archives i.e. moving from a manual system-centric approach to a more automatic user-centric approach.
  • Publication
    Audio Impairment Recognition using a Correlation-Based Feature Representation
    Audio impairment recognition is based on finding noise in audio files and categorising the impairment type. Recently, significant performance improvement has been obtained thanks to the usage of advanced deep learning models. However, feature robustness is still an unresolved issue and it is one of the main reasons why we need powerful deep learning architectures. In the presence of a variety of musical styles, handcrafted features are less efficient in capturing audio degradation characteristics and they are prone to failure when recognising audio impairments and could mistakenly learn musical concepts rather than impairment types. In this paper, we propose a new representation of hand-crafted features that is based on the correlation of feature pairs. We experimentally compare the proposed correlation-based feature representation with a typical raw feature representation used in machine learning and we show superior performance in terms of compact feature dimensionality and improved computational speed in the test stage whilst achieving comparable accuracy.
      164Scopus© Citations 2
  • Publication
    Adapting the Quality of Experience Framework for Audio Archive Evaluation
    Perceived quality of historical audio material that is subjected to digitisation and restoration is typically evaluated by individual judgements or with inappropriate objective quality models. This paper presents a Quality of Experience (QoE) framework for predicting perceived audio quality of sound archives. The approach consists in adapting concepts used in QoE evaluation to digital audio archives. Limitations of current objective quality models employed in audio archives are provided and reasons why a QoE-based framework can overcome these limitations are discussed. This paper shows that applying a QoE framework to audio archives is feasible and it helps to identify the stages, stakeholders and models for a QoE centric approach.
      239Scopus© Citations 6
  • Publication
    Fusion confusion: Exploring ambisonic spatial localisation for audio-visual immersion using the McGurk effect
    Virtual Reality (VR) is attracting the attention of application developers for purposes beyond entertainment including serious games, health, education and training. By including 3D audio the overall VR quality of experience (QoE) will be enhanced through greater immersion. Better understanding the perception of spatial audio localisation in audio-visual immersion is needed especially in streaming applications where bandwidth is limited and compression is required. This paper explores the impact of audio-visual fusion on speech due to mismatches in a perceived talker location and the corresponding sound using a phenomenon known as the McGurk effect and binaurally rendered Ambisonic spatial audio. The illusion of the McGurk effect happens when a sound of a syllable paired with a video of a second syllable, gives the perception of a third syllable. For instance the sound of /ba/ dubbed in video of /ga/ will lead to the illusion of hearing /da/. Several studies investigated factors involved in the McGurk effect, but a little has been done to understand the audio spatial effect on this illusion. 3D spatial audio generated with Ambisonics has been shown to provide satisfactory QoE with respect to localisation of sound sources which makes it suitable for VR applications but not for audio visual talker scenarios. In order to test the perception of the McGurk effect at different direction of arrival (DOA) of sound, we rendered Ambisonics signals at the azimuth of 0°, 30°, 60°, and 90° to both the left and right of the video source. The results show that the audio visual fusion significantly affects the perception of the speech. Yet the spatial audio does not significantly impact the illusion. This finding suggests that precise localisation of speech audio might not be as critical for speech intelligibility. It was found that a more significant factor was the intelligibility of speech itself.
      334Scopus© Citations 5