Adaptive Contextual Feature Fusion: Leveraging Human-Robot Interaction with Speech Emotion Recognition

Published in IEEE 21st India Council International Conference (INDICON), 2024

Authors: Sougatamoy Biswas, Romala Mishra, Pratik Kumar Sahoo, Anup Nandy

Speech Emotion Recognition (SER) is essential in Human-Robot Interaction (HRI) as it empowers robots to detect and react to human emotions. However, existing Speech Emotion Recognition systems face challenges in capturing the full range of emotional expressions due to the complex interaction of various speech features. This research introduces an innovative method utilizing an Adaptive Contextual Feature Fusion (ACFF) technique. Our method employs Adaptive Contextual Feature Fusion to dynamically fuse a hybrid set of features including Mel-scaled spectrogram, Mel-frequency Cepstral Coefficients (MFCCs), Zero-Crossing Rate (ZCR), and Root Mean Square Energy (RMSE) that captures both spectral and temporal characteristics essential for accurate emotion recognition. The Convolutional Neural Network with Long Short-Term Memory (CNN-LSTM) architecture is then employed to learn spatial and temporal dependencies from the adaptively fused features. The proposed approach is evaluated on a publicly available RAVDESS emotional speech dataset. The proposed CNN-LSTM with Adaptive Contextual Feature Fusion and hybrid features achieved 75.45% accuracy and outperforms other state-of-the-art methods. [Website]

Share on

Twitter Facebook LinkedIn

Romala Mishra

Share on

Leave a Comment