著者
Reda Elbarougy Bagus Tris Atmaja Masato Akagi
出版者
Research Institute of Signal Processing, Japan
雑誌
Journal of Signal Processing (ISSN:13426230)
巻号頁・発行日
vol.24, no.6, pp.229-235, 2020-11-01 (Released:2020-11-01)
参考文献数
23

Speech and visual information are the most dominant modalities for a human to perceive emotion. A method of recognizing human emotion from these modalities is proposed by utilizing feature selection and long short-term memory (LSTM) neural networks. A feature selection method based on support vector regression is used to select the relevant features among thousands of features extended from speech and video features via bag-of-X-words. The LSTM neural networks then are trained using a number of selected features and also separately optimized for every emotion dimension. Instead of utterance-level emotion recognition, time-frame-based processing is performed to enable continuous emotion recognition using a database labeled for each time frame. Experimental results reveal that a system with feature selection is more effective for predicting emotion dimensions for a single language than the baseline system without feature selection. The performance is measured in terms of the concordance correlation coefficient obtained by averaging the valence, arousal, and liking dimensions.
著者
Masato Akagi Taro Ienaga
出版者
Acoustical Society of Japan
雑誌
Journal of the Acoustical Society of Japan (E) (ISSN:03882861)
巻号頁・発行日
vol.18, no.2, pp.73-80, 1997 (Released:2011-02-17)
参考文献数
7
被引用文献数
2 9

Speaker individualities in fundamental frequency (F0) contours are investigated through analyses of several speakers'uttered speech and psychoacoustic experiments. The analyses are performed to extract significant physical characteristics of F0 by using Fujisaki and Hirose's analysis method and the F-ratio of each physical characteristic. The experiments are performed to clarify the relationship between these physical characteristics and the perception of speaker's speech. The stimuli used in the experiments are re-synthesized with manipulated Fo contours and spectral envelopes averaged overall for all speakers by using the Log Magnitude Approximation analysis-synthesis system. The analysis and experimental results indicate that (1) there is speaker individuality in the Fo contours, (2) some specific parameters related to the dynamics of F0 contours have many speaker individuality features and speaker individuality can be controlled by manipulating these parameters, and (3) although there are speaker individuality features in the time-averaged F0, they help improve speaker identification less than the dynamics of the F0 contours.
著者
Jianwu Dang Aijun Li Donna Erickson Atsuo Suemitsu Masato Akagi Kyoko Sakuraba Nobuaki Minematsu Keikichi Hirose
出版者
ACOUSTICAL SOCIETY OF JAPAN
雑誌
Acoustical Science and Technology (ISSN:13463969)
巻号頁・発行日
vol.31, no.6, pp.394-402, 2010-11-01 (Released:2010-11-01)
参考文献数
16
被引用文献数
1 11

In this study, we conducted a comparative experiment on emotion perception among different cultures. Emotional components were perceived by subjects from Japan, the United States and China, all of whom had no experience living abroad. An emotional speech database without linguistic information was used in this study and evaluated using three- and/or six-emotional dimensions. Principal component analysis (PCA) indicates that the common factors could explain about 60% variance of the data among the three cultures by using a three-emotion description and about 50% variance between Japanese and Chinese cultures by using a six-emotion description. The effects of the emotion categories on perception results were investigated. The emotions of anger, joy and sadness (group 1) have consistent structures in PCA-based spaces when switching from three-emotion categories to six-emotion categories. Disgust, surprise, and fear (group 2) appeared as paired counterparts of anger, joy and sadness, respectively. When investigating the subspaces constructed by these two groups, the similarity between the two emotion groups was found to be fairly high in the two-dimensional space. The similarity becomes lower in 3- or higher dimensional spaces, but not significantly different. The results from this study suggest that a wide range of human emotions might fall into a small subspace of basic emotions.
著者
Rieko Kubo Masato Akagi Reiko Akahane-Yamada
出版者
一般社団法人 日本音響学会
雑誌
Acoustical Science and Technology (ISSN:13463969)
巻号頁・発行日
vol.36, no.5, pp.397-407, 2015 (Released:2015-09-01)
参考文献数
15

This study investigated the differences in first-language-based (L1-based) phonetic processing for second language (L2) phonemes among different age groups of adults. A speech-in-speech masking paradigm was utilized to examine the contribution of the L1-based processing. A phoneme identification task in one language was conducted in the presence or absence of an interferer of a masker of the same or a different language. The degree of interference (i.e., the decrease in identification performance) was postulated to increase as the similarity of underlying processes for the target and masker increases. Experiment 1 was conducted to test the effectiveness of the paradigm. As expected, the interference increased as the similarity of underlying processes for the target and masker increased. Experiment 2 examined the perception of English /r/–/l/ and other phonetic contrasts by Japanese listeners in various adult age groups, to examine whether the degree of interference differs depending on the putative degrees of L1-based processing and on age. The results demonstrated such differences and showed that the L1-based processing can be estimated from the decrease in the identification performance. They also suggested that the perception of /r/–/l/ in the initial singleton and initial cluster positions was high L1-based in older adults.