研究生: |
呂健維 Lu, Chien-Wei |
---|---|
論文名稱: |
基於臉部及語音特徵之輕量化深度學習情感辨識系統 Lightweight Deep Learning Emotion Recognition System Based on Facial and Speech Features |
指導教授: |
呂成凱
Lu, Cheng-Kai |
口試委員: |
呂成凱
Lu, Cheng-Kai 林承鴻 Lin, Cheng-Hung 連中岳 Lien, Chung-Yueh |
口試日期: | 2024/07/15 |
學位類別: |
碩士 Master |
系所名稱: |
電機工程學系 Department of Electrical Engineering |
論文出版年: | 2024 |
畢業學年度: | 112 |
語文別: | 中文 |
論文頁數: | 86 |
中文關鍵詞: | 深度學習 、雙模態情感識別 、輕量化模型 、卷積神經網路 、陪伴型機器人 |
英文關鍵詞: | Deep Learning, Bimodal Emotion Recognition, Lightweight Models, Convolutional Neural Networks, Companion Robots |
研究方法: | 實驗設計法 |
DOI URL: | http://doi.org/10.6345/NTNU202401361 |
論文種類: | 學術論文 |
相關次數: | 點閱:151 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
因應近年來高齡化導致老人照護人力缺乏,本研究提出了一種可被應用於陪伴型機器人(Zenbo Junior II)上的整合臉部表情和語音的情感識別輕量化模型。近年來對於人類的情感識別技術大多使用基於卷積神經網路(Convolutional Neural Network, CNN)的方式來實現,並得到了優秀的成果,然而,這些先進的技術都沒有考慮計算成本的問題,導致這些技術在計算能力有限的設備上無法運行(例如,陪伴型機器人)。因此,本研究將輕量化的GhostNet模型,應用於臉部情感識別的模型,並將輕量化的一維卷積神經網路(One Dimensional Convolutional Neural Network, 1D-CNN)作為語音情感識別模型,再利用幾何平均數的方式將兩個模態預測的結果整合。所提出的模型,在RAVDESS和CREMA-D兩個數據集上分別取得了97.56%及82.33%的準確率,在確保了高準確率的情況下,本研究將參數量壓縮到了0.92M,浮點運算次數減少至0.77G,比起目前已知的先進技術要少了數十倍。最後,將本研究的模型實際部署在Zenbo Junior II中,並透過模型與硬體的運算強度作比較,得知本研究的模型能夠更加順利的在該硬體中運行,且臉部及語音情感識別模型的推理時間分別只有1500毫秒及12毫秒。
According to the shortage of human resources to take care of the elderly due to the aging population in recent years, this study proposes a lightweight model that integrates facial and speech emotion recognition and can be applied to a companion robot, Zenbo Junior II. In recent years, most of the human emotion recognition techniques have been implemented using Convolutional Neural Network (CNN) based approaches and have achieved excellent results. However, these advanced techniques do not take into account the computational cost, which makes them unworkable on devices with limited computational power, including companion robots. Thus, this study constructs a more lightweight GhostNet as a model for facial emotion recognition and a lightweight 1D-CNN as a model for speech emotion recognition, and utilizes the geometric mean to predict the two modalities. The results of the two modalities are integrated in the RAVDESS and CREMA-D datasets, achieving 97.56% and 82.33% accuracy. The number of parameters was compressed to 0.92M and the floating-point operations was reduced to 0.77G, which is tens of times less than that of the state-of-the-art technology, with high accuracy. Finally, the model was actually deployed in Zenbo Junior II, and by comparing the computational intensity of the model and the hardware, it was learned that the model was able to run more smoothly in Zenbo Junior II, and the inference time of face and speech emotion recognition models are only 1500 ms and 12 ms.
[1] 國家發展委員會人口推估查詢系統。[Online]. Available: https://pop-proj.ndc.gov.tw/Default.aspx
[2] 衛生福利部公布最新臺灣社區失智症流行病學調查結果。[Online]. Available: https://www.mohw.gov.tw/cp-16-78102-1.html
[3] Zenbo Junior II. [Online]. Available: https://zenbo.asus.com/tw/product/zenbojuniorii/overview/
[4] K. Han, Y. Wang, Q. Tian, J. Guo, C. Xu, and C. Xu, "Ghostnet: More features from cheap operations," in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 1580-1589.
[5] D. Misra, T. Nalamada, AU. Arasanipalai, and Q. Hou, "Rotate to attend: Convolutional triplet attention module," in Proc. IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 3139-3148.
[6] D. Meng, X. Peng, K. Wang, and Y. Qiao, "Frame attention networks for facial expression recognition in videos," in Proc. 2019 IEEE International Conference on Image Processing (ICIP), 2019, pp. 3866-3870.
[7] Android Studio, https://developer.android.com/studio?hl=zh-tw
[8] P. Ekman, "An argument for basic emotions," Cognition & Emotion, vol. 6, no. 3-4, pp. 169-200, 1992.
[9] JA. Russell, A. Weiss, and GA. Mendelsohn, "Affect grid: a single-item scale of pleasure and arousal," Journal of Personality and Social Psychology, vol. 57, no. 3, pp. 493, 1989.
[10] L. Richard, Emotion and Adaptation. Oxford University Press, 1991.
[11] I. Kansizoglou, L. Bampis, and A. Gasteratos, "An active learning paradigm for online audio-visual emotion recognition," IEEE Transactions on Affective Computing, vol. 13, no. 2, pp. 756-768, 2019.
[12] N. Banskota, A. Alsadoon, P. W. C. Prasad, A. Dawoud, TA. Rashid, and O. H. Alsadoon, "A novel enhanced convolution neural network with extreme learning machine: facial emotional recognition in psychology practices," Multimedia Tools and Applications, vol. 82, no. 5, pp. 6479-6503, 2023.
[13] P. Viola and M. Jones, "Rapid object detection using a boosted cascade of simple features," in Proc. 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2001), vol. 1, 2001, pp. I-I.
[14] J. He, X. Yu, B. Sun, and L. Yu, "Facial expression and action unit recognition augmented by their dependencies on graph convolutional networks," Journal on Multimodal User Interfaces, vol. 15, no. 1, pp. 1-12, 2021.
[15] D. Krishnani, P. Shivakumara, T. Lu, U. Pal, D. Lopresti, and G. H. Kumar, "A new context-based feature for classification of emotions in photographs," Multimedia Tools and Applications, vol. 80, pp. 15589-15618, 2021.
[16] Q. Sun, L. Liang, X. Dang, and Y. Chen, "Deep learning-based dimensional emotion recognition combining the attention mechanism and global second-order feature representations," Computers and Electrical Engineering, vol. 104, pp. 108469, 2022.
[17] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, "Joint face detection and alignment using multitask cascaded convolutional networks," IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1499-1503, 2016.
[18] P. Zarbakhsh and H. Demirel, "4D facial expression recognition using multimodal time series analysis of geometric landmark-based deformations," The Visual Computer, vol. 36, no. 5, pp. 951-965, 2020.
[19] N. Kim, S. Cho, and B. Bae, "SMaTE: A Segment-Level Feature Mixing and Temporal Encoding Framework for Facial Expression Recognition," Sensors, vol. 22, no. 15, pp. 5753, 2022.
[20] C. Zhu, T. Ding, and X. Min, "Emotion recognition of college students based on audio and video image," Traitement du Signal, vol. 39, no. 5, 2022.
[21] M. Hao, W. H. Cao, Z. T. Liu, M. Wu, and P. Xiao, "Visual-audio emotion recognition based on multi-task and ensemble learning with multiple features," Neurocomputing, vol. 391, pp. 42-51, 2020.
[22] E. Pei, D. Jiang, and H. Sahli, "An efficient model-level fusion approach for continuous affect recognition from audiovisual signals," Neurocomputing, vol. 376, pp. 42-53, 2020.
[23] Z. Wu, X. Zhang, T. Zhi-Xuan, J. Zaki, and DC. Ong, "Attending to emotional narratives," in Proc. 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII), 2019, pp. 648-654.
[24] E. Ghaleb, J. Niehues, and S. Asteriadis, "Joint modelling of audio-visual cues using attention mechanisms for emotion recognition," Multimedia Tools and Applications, vol. 82, no. 8, pp. 11239-11264, 2023.
[25] N. Hajarolasvadi, E. Bashirov, and H. Demirel, "Video-based person-dependent and person-independent facial emotion recognition," Signal, Image and Video Processing, vol. 15, no. 5, pp. 1049-1056, 2021.
[26] J. Wei, X. Yang, and Y. Dong, "User-generated video emotion recognition based on key frames," Multimedia Tools and Applications, vol. 80, pp. 14343-14361, 2021.
[27] SH. Zou, X. Huang, XD. Shen, and H. Liu, "Improving multimodal fusion with Main Modal Transformer for emotion recognition in conversation," Knowledge-Based Systems, vol. 258, pp. 109978, 2022.
[28] NN. Lakshminarayana, N. Sankaran, S. Setlur, and V. Govindaraju, "Multimodal deep feature aggregation for facial action unit recognition using visible images and physiological signals," in Proc. 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), 2019, pp. 1-4.
[29] M. Chen and Y. Hao, "Label-less learning for emotion cognition," IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 7, pp. 2430-2440, 2019.
[30] D. Nguyen, DT. Nguyen, R. Zeng, TT. Nguyen, SN. Tran, T. Nguyen, S. Sridharan, and C. Fookes, "Deep auto-encoders with sequential learning for multimodal dimensional emotion recognition," IEEE Transactions on Multimedia, vol. 24, pp. 1313-1324, 2021.
[31] J. Fu, Q. Mao, J. Tu, and Y. Zhan, "Multimodal shared features learning for emotion recognition by enhanced sparse local discriminative canonical correlation analysis," Multimedia Systems, vol. 25, no. 5, pp. 451-461, 2019.
[32] H. Filali, J. Riffi, I. Aboussaleh, AM. Mahraz, and H. Tairi, "Meaningful learning for deep facial emotional features," Neural Processing Letters, vol. 54, no. 1, pp. 387-404, 2022.
[33] K. Chauhan, KK. Sharma, and T. Varma, "Improved speech emotion recognition using channel-wise global head pooling (CWGHP)," Circuits, Systems, and Signal Processing, vol. 42, no. 9, pp. 5500-5522, 2023.
[34] L. Chen, M. Li, M. Wu, W. Pedrycz, and K. Hirota, "Coupled multimodal emotional feature analysis based on broad-deep fusion networks in human–robot interaction," IEEE Transactions on Neural Networks and Learning Systems, 2023.
[35] D. Wu, J. Zhang, and Q. Zhao, "Multimodal fused emotion recognition about expression-EEG interaction and collaboration using deep learning," IEEE Access, vol. 8, pp. 133180-133189, 2020.
[36] M. Sharafi, M. Yazdchi, R. Rasti, and F. Nasimi, "A novel spatio-temporal convolutional neural framework for multimodal emotion recognition," Biomedical Signal Processing and Control, vol. 78, pp. 103970, 2022.
[37] S. Wang, J. Qu, Y. Zhang, and Y. Zhang, "Multimodal emotion recognition from EEG signals and facial expressions," IEEE Access, vol. 11, pp. 33061-33068, 2023.
[38] F. Chen, J. Shao, A. Zhu, D. Ouyang, X. Liu, and HT. Shen, "Modeling hierarchical uncertainty for multimodal emotion recognition in conversation," IEEE Transactions on Cybernetics, 2022.
[39] N. Braunschweiler, R. Doddipatla, S. Keizer, and S. Stoyanchev, "Factors in emotion recognition with deep learning models using speech and text on multiple corpora," IEEE Signal Processing Letters, vol. 29, pp. 722-726, 2022.
[40] Z. Hu, L. Wang, Y. Luo, Y. Xia, and H. Xiao, "Speech emotion recognition model based on attention CNN Bi-GRU fusing visual information," Engineering Letters, vol. 30, no. 2, 2022.
[41] B. Mocanu, R. Tapu, and T. Zaharia, "Multimodal emotion recognition using cross modal audio-video fusion with attention and deep metric learning," Image and Vision Computing, vol. 133, pp. 104676, 2023.
[42] E. Ryumina and A. Karpov, "Facial expression recognition using distance importance scores between facial landmarks," in Proc. CEUR Workshop Proceedings, 2020, pp. 1-10.
[43] 林彥榕(2024)。用於陪伴型機器人之輕量化深度學習音樂情緒辨識模型。碩士論文,國立臺灣師範大學電機工程學系,2024。