Speech emotion recognition using deep neural networks

Bakhshi, Ali

Title: Speech emotion recognition using deep neural networks
Creator: Bakhshi, Ali
Relation: University of Newcastle Research Higher Degree Thesis
Resource Type: thesis
Date: 2021
Description: Research Doctorate - Doctor of Philosophy (PhD)
Description: Emotion recognition is an interdisciplinary research area in psychology, social science, signal processing, and image processing. From a machine learning point of view, emotion recognition is a challenging task due to the different modalities used to express emotions. In this Ph.D. thesis, various speech emotion recognition frameworks have been proposed, most of which have been designed based on deep neural networks using end-to-end learning. A combination of speech and physiological signals has been used in a multimodal model to recognise real emotions through these modalities. As the first step, given the importance of deep neural networks in different applications, evolutionary algorithms have been used to find the best architecture and hyperparameters for DNNs designed for image classification tasks. In this thesis, speech signals have mainly been used for emotion recognition, as speech signals are the simplest means of communicating between humans and are a rich source of emotional information. Hence, the first speech emotion recognition architecture was designed based on a hierarchical classifier that used Cepstral coefficients based on evolutionary filterbanks as the emotional features. The optimised classifiers outperformed conventional Mel Frequency Cepstral Coefficients in terms of overall emotion classification accuracy. Next, an end-to-end speech emotion recognition model is proposed that uses a relatively small training set for training a nearly deep model from scratch. Using almost one-third of the RECOLA dataset, the proposed deep model showed a comparable prediction of the arousal and valence states compared with the prediction of models that used the whole RECOLA dataset. A combination of audio and physiological signals available in the RECOLA dataset were then used in an end-to-end deep multimodal system to predict valid labels for different emotional dimensions. The results achieved using the multimodal model show improved prediction results compared to the unimodal models, especially in terms of valence state. As an application of emotion recognition in real life, we utilised the speech signals extracted from surveillance cameras for detecting violence in real situations. Two different DNN frameworks were proposed based on raw speech signals and Mel-spectrograms of speech signals for violence detection. Considering the lack of sufficient pre-trained deep models for speech signals, I proposed two different speech-to-image transforms, CyTex, and PhaSion transforms, which are the main contributions of my thesis. The images generated by the CyTex and PhaSion transforms can be used as the inputs to the pre-trained image-based DNN models that have shown promising performances in various applications. These two speech-to-image transforms are reversible, computationally efficient, and lossless, which ensures no emotion-related features of the speech signals are neglected during the speech-to-image transformation. Using the CyTex and PhaSion images and pre-trained DNN models, we achieved promising results for emotion classification on two popular emotion datasets, the EmoDB and IEMOCAP datasets.
Subject: speech emotion recognition; multimodal emotion recognition; deep neural networks; evolutionary algorithms
Identifier: http://hdl.handle.net/1959.13/1430839
Identifier: uon:38885
Language: eng
Full Text

Hits: 977
Visitors: 2002
Downloads: 1239

		Thumbnail	File	Description	Size	Format
View Details Download			ATTACHMENT01	Thesis	3 MB	Adobe Acrobat PDF	View Details Download
View Details Download			ATTACHMENT02	Abstract	206 KB	Adobe Acrobat PDF	View Details Download