著者
Reda Elbarougy Bagus Tris Atmaja Masato Akagi
出版者
Research Institute of Signal Processing, Japan
雑誌
Journal of Signal Processing (ISSN:13426230)
巻号頁・発行日
vol.24, no.6, pp.229-235, 2020-11-01 (Released:2020-11-01)
参考文献数
23

Speech and visual information are the most dominant modalities for a human to perceive emotion. A method of recognizing human emotion from these modalities is proposed by utilizing feature selection and long short-term memory (LSTM) neural networks. A feature selection method based on support vector regression is used to select the relevant features among thousands of features extended from speech and video features via bag-of-X-words. The LSTM neural networks then are trained using a number of selected features and also separately optimized for every emotion dimension. Instead of utterance-level emotion recognition, time-frame-based processing is performed to enable continuous emotion recognition using a database labeled for each time frame. Experimental results reveal that a system with feature selection is more effective for predicting emotion dimensions for a single language than the baseline system without feature selection. The performance is measured in terms of the concordance correlation coefficient obtained by averaging the valence, arousal, and liking dimensions.