Learning multi-scale features for speech emotion recognition with connection attention mechanism
Speech emotion recognition (SER) has become a crucial topic in the field of human–computer interactions. Feature representation plays an important role in SER, but there are still many challenges in feature representation such as the inability to predict which features are most effective for SER and...
Ausführliche Beschreibung
Autor*in: |
Chen, Zengzhao [verfasserIn] |
---|
Format: |
E-Artikel |
---|---|
Sprache: |
Englisch |
Erschienen: |
2023transfer abstract |
---|
Schlagwörter: |
---|
Übergeordnetes Werk: |
Enthalten in: Do denture processing techniques affect the mechanical properties of denture teeth? - Clements, Jody L. ELSEVIER, 2017, an international journal, Amsterdam [u.a.] |
---|---|
Übergeordnetes Werk: |
volume:214 ; year:2023 ; day:15 ; month:03 ; pages:0 |
Links: |
---|
DOI / URN: |
10.1016/j.eswa.2022.118943 |
---|
Katalog-ID: |
ELV05978976X |
---|
LEADER | 01000caa a22002652 4500 | ||
---|---|---|---|
001 | ELV05978976X | ||
003 | DE-627 | ||
005 | 20230626053951.0 | ||
007 | cr uuu---uuuuu | ||
008 | 221219s2023 xx |||||o 00| ||eng c | ||
024 | 7 | |a 10.1016/j.eswa.2022.118943 |2 doi | |
028 | 5 | 2 | |a /cbs_pica/cbs_olc/import_discovery/elsevier/einzuspielen/GBV00000000001986.pica |
035 | |a (DE-627)ELV05978976X | ||
035 | |a (ELSEVIER)S0957-4174(22)01961-3 | ||
040 | |a DE-627 |b ger |c DE-627 |e rakwb | ||
041 | |a eng | ||
082 | 0 | 4 | |a 610 |q VZ |
084 | |a 44.96 |2 bkl | ||
100 | 1 | |a Chen, Zengzhao |e verfasserin |4 aut | |
245 | 1 | 0 | |a Learning multi-scale features for speech emotion recognition with connection attention mechanism |
264 | 1 | |c 2023transfer abstract | |
336 | |a nicht spezifiziert |b zzz |2 rdacontent | ||
337 | |a nicht spezifiziert |b z |2 rdamedia | ||
338 | |a nicht spezifiziert |b zu |2 rdacarrier | ||
520 | |a Speech emotion recognition (SER) has become a crucial topic in the field of human–computer interactions. Feature representation plays an important role in SER, but there are still many challenges in feature representation such as the inability to predict which features are most effective for SER and the cultural differences in emotion expression. Most previous studies use a single type of feature for the recognition task or conduct early fusion of features. However, a single type of feature cannot well reflect the emotions of speech signals. Also, different features contain different information, direct fusion cannot integrate the advantages of different features. To overcome these challenges, this paper proposes a parallel network for multi-scale SER based on a connection attention mechanism (AMSNet). AMSNet fuses fine-grained frame-level manual features with coarse-grained utterance-level deep features. Meanwhile, it adopts different speech emotion feature extraction modules according to the temporal and spatial features of speech signals, which enriches features and improves feature characterization. The network consists of a frame-level representation learning module (FRLM) based on the time structure and an utterance-level representation learning module (URLM) based on the global structure. Besides, improved attention-based long short-term memory (LSTM) is introduced into FRLM to focus on the frames that contribute more to the final emotion recognition result. In URLM, a convolutional neural network with the squeeze-and-excitation block (SCNN) is introduced to extract deep features. In addition, the connection attention mechanism is proposed for feature fusion, which applies different weights to different features. Extensive experiments are conducted on the IEMOCAP and EmoDB datasets, and the results demonstrate the effectiveness and performance superiority of AMSNet. Our code will be publicly available at https://codeocean.com/capsule/8636967/tree/v1. | ||
520 | |a Speech emotion recognition (SER) has become a crucial topic in the field of human–computer interactions. Feature representation plays an important role in SER, but there are still many challenges in feature representation such as the inability to predict which features are most effective for SER and the cultural differences in emotion expression. Most previous studies use a single type of feature for the recognition task or conduct early fusion of features. However, a single type of feature cannot well reflect the emotions of speech signals. Also, different features contain different information, direct fusion cannot integrate the advantages of different features. To overcome these challenges, this paper proposes a parallel network for multi-scale SER based on a connection attention mechanism (AMSNet). AMSNet fuses fine-grained frame-level manual features with coarse-grained utterance-level deep features. Meanwhile, it adopts different speech emotion feature extraction modules according to the temporal and spatial features of speech signals, which enriches features and improves feature characterization. The network consists of a frame-level representation learning module (FRLM) based on the time structure and an utterance-level representation learning module (URLM) based on the global structure. Besides, improved attention-based long short-term memory (LSTM) is introduced into FRLM to focus on the frames that contribute more to the final emotion recognition result. In URLM, a convolutional neural network with the squeeze-and-excitation block (SCNN) is introduced to extract deep features. In addition, the connection attention mechanism is proposed for feature fusion, which applies different weights to different features. Extensive experiments are conducted on the IEMOCAP and EmoDB datasets, and the results demonstrate the effectiveness and performance superiority of AMSNet. Our code will be publicly available at https://codeocean.com/capsule/8636967/tree/v1. | ||
650 | 7 | |a Connection attention mechanism |2 Elsevier | |
650 | 7 | |a Features fusion |2 Elsevier | |
650 | 7 | |a Utterance-level features |2 Elsevier | |
650 | 7 | |a Frame-level features |2 Elsevier | |
650 | 7 | |a Speech emotion recognition |2 Elsevier | |
700 | 1 | |a Li, Jiawen |4 oth | |
700 | 1 | |a Liu, Hai |4 oth | |
700 | 1 | |a Wang, Xuyang |4 oth | |
700 | 1 | |a Wang, Hu |4 oth | |
700 | 1 | |a Zheng, Qiuyu |4 oth | |
773 | 0 | 8 | |i Enthalten in |n Elsevier Science |a Clements, Jody L. ELSEVIER |t Do denture processing techniques affect the mechanical properties of denture teeth? |d 2017 |d an international journal |g Amsterdam [u.a.] |w (DE-627)ELV000222070 |
773 | 1 | 8 | |g volume:214 |g year:2023 |g day:15 |g month:03 |g pages:0 |
856 | 4 | 0 | |u https://doi.org/10.1016/j.eswa.2022.118943 |3 Volltext |
912 | |a GBV_USEFLAG_U | ||
912 | |a GBV_ELV | ||
912 | |a SYSFLAG_U | ||
912 | |a SSG-OLC-PHA | ||
936 | b | k | |a 44.96 |j Zahnmedizin |q VZ |
951 | |a AR | ||
952 | |d 214 |j 2023 |b 15 |c 0315 |h 0 |
author_variant |
z c zc |
---|---|
matchkey_str |
chenzengzhaolijiawenliuhaiwangxuyangwang:2023----:erigutsaeetrsoseceoineontowtcne |
hierarchy_sort_str |
2023transfer abstract |
bklnumber |
44.96 |
publishDate |
2023 |
allfields |
10.1016/j.eswa.2022.118943 doi /cbs_pica/cbs_olc/import_discovery/elsevier/einzuspielen/GBV00000000001986.pica (DE-627)ELV05978976X (ELSEVIER)S0957-4174(22)01961-3 DE-627 ger DE-627 rakwb eng 610 VZ 44.96 bkl Chen, Zengzhao verfasserin aut Learning multi-scale features for speech emotion recognition with connection attention mechanism 2023transfer abstract nicht spezifiziert zzz rdacontent nicht spezifiziert z rdamedia nicht spezifiziert zu rdacarrier Speech emotion recognition (SER) has become a crucial topic in the field of human–computer interactions. Feature representation plays an important role in SER, but there are still many challenges in feature representation such as the inability to predict which features are most effective for SER and the cultural differences in emotion expression. Most previous studies use a single type of feature for the recognition task or conduct early fusion of features. However, a single type of feature cannot well reflect the emotions of speech signals. Also, different features contain different information, direct fusion cannot integrate the advantages of different features. To overcome these challenges, this paper proposes a parallel network for multi-scale SER based on a connection attention mechanism (AMSNet). AMSNet fuses fine-grained frame-level manual features with coarse-grained utterance-level deep features. Meanwhile, it adopts different speech emotion feature extraction modules according to the temporal and spatial features of speech signals, which enriches features and improves feature characterization. The network consists of a frame-level representation learning module (FRLM) based on the time structure and an utterance-level representation learning module (URLM) based on the global structure. Besides, improved attention-based long short-term memory (LSTM) is introduced into FRLM to focus on the frames that contribute more to the final emotion recognition result. In URLM, a convolutional neural network with the squeeze-and-excitation block (SCNN) is introduced to extract deep features. In addition, the connection attention mechanism is proposed for feature fusion, which applies different weights to different features. Extensive experiments are conducted on the IEMOCAP and EmoDB datasets, and the results demonstrate the effectiveness and performance superiority of AMSNet. Our code will be publicly available at https://codeocean.com/capsule/8636967/tree/v1. Speech emotion recognition (SER) has become a crucial topic in the field of human–computer interactions. Feature representation plays an important role in SER, but there are still many challenges in feature representation such as the inability to predict which features are most effective for SER and the cultural differences in emotion expression. Most previous studies use a single type of feature for the recognition task or conduct early fusion of features. However, a single type of feature cannot well reflect the emotions of speech signals. Also, different features contain different information, direct fusion cannot integrate the advantages of different features. To overcome these challenges, this paper proposes a parallel network for multi-scale SER based on a connection attention mechanism (AMSNet). AMSNet fuses fine-grained frame-level manual features with coarse-grained utterance-level deep features. Meanwhile, it adopts different speech emotion feature extraction modules according to the temporal and spatial features of speech signals, which enriches features and improves feature characterization. The network consists of a frame-level representation learning module (FRLM) based on the time structure and an utterance-level representation learning module (URLM) based on the global structure. Besides, improved attention-based long short-term memory (LSTM) is introduced into FRLM to focus on the frames that contribute more to the final emotion recognition result. In URLM, a convolutional neural network with the squeeze-and-excitation block (SCNN) is introduced to extract deep features. In addition, the connection attention mechanism is proposed for feature fusion, which applies different weights to different features. Extensive experiments are conducted on the IEMOCAP and EmoDB datasets, and the results demonstrate the effectiveness and performance superiority of AMSNet. Our code will be publicly available at https://codeocean.com/capsule/8636967/tree/v1. Connection attention mechanism Elsevier Features fusion Elsevier Utterance-level features Elsevier Frame-level features Elsevier Speech emotion recognition Elsevier Li, Jiawen oth Liu, Hai oth Wang, Xuyang oth Wang, Hu oth Zheng, Qiuyu oth Enthalten in Elsevier Science Clements, Jody L. ELSEVIER Do denture processing techniques affect the mechanical properties of denture teeth? 2017 an international journal Amsterdam [u.a.] (DE-627)ELV000222070 volume:214 year:2023 day:15 month:03 pages:0 https://doi.org/10.1016/j.eswa.2022.118943 Volltext GBV_USEFLAG_U GBV_ELV SYSFLAG_U SSG-OLC-PHA 44.96 Zahnmedizin VZ AR 214 2023 15 0315 0 |
spelling |
10.1016/j.eswa.2022.118943 doi /cbs_pica/cbs_olc/import_discovery/elsevier/einzuspielen/GBV00000000001986.pica (DE-627)ELV05978976X (ELSEVIER)S0957-4174(22)01961-3 DE-627 ger DE-627 rakwb eng 610 VZ 44.96 bkl Chen, Zengzhao verfasserin aut Learning multi-scale features for speech emotion recognition with connection attention mechanism 2023transfer abstract nicht spezifiziert zzz rdacontent nicht spezifiziert z rdamedia nicht spezifiziert zu rdacarrier Speech emotion recognition (SER) has become a crucial topic in the field of human–computer interactions. Feature representation plays an important role in SER, but there are still many challenges in feature representation such as the inability to predict which features are most effective for SER and the cultural differences in emotion expression. Most previous studies use a single type of feature for the recognition task or conduct early fusion of features. However, a single type of feature cannot well reflect the emotions of speech signals. Also, different features contain different information, direct fusion cannot integrate the advantages of different features. To overcome these challenges, this paper proposes a parallel network for multi-scale SER based on a connection attention mechanism (AMSNet). AMSNet fuses fine-grained frame-level manual features with coarse-grained utterance-level deep features. Meanwhile, it adopts different speech emotion feature extraction modules according to the temporal and spatial features of speech signals, which enriches features and improves feature characterization. The network consists of a frame-level representation learning module (FRLM) based on the time structure and an utterance-level representation learning module (URLM) based on the global structure. Besides, improved attention-based long short-term memory (LSTM) is introduced into FRLM to focus on the frames that contribute more to the final emotion recognition result. In URLM, a convolutional neural network with the squeeze-and-excitation block (SCNN) is introduced to extract deep features. In addition, the connection attention mechanism is proposed for feature fusion, which applies different weights to different features. Extensive experiments are conducted on the IEMOCAP and EmoDB datasets, and the results demonstrate the effectiveness and performance superiority of AMSNet. Our code will be publicly available at https://codeocean.com/capsule/8636967/tree/v1. Speech emotion recognition (SER) has become a crucial topic in the field of human–computer interactions. Feature representation plays an important role in SER, but there are still many challenges in feature representation such as the inability to predict which features are most effective for SER and the cultural differences in emotion expression. Most previous studies use a single type of feature for the recognition task or conduct early fusion of features. However, a single type of feature cannot well reflect the emotions of speech signals. Also, different features contain different information, direct fusion cannot integrate the advantages of different features. To overcome these challenges, this paper proposes a parallel network for multi-scale SER based on a connection attention mechanism (AMSNet). AMSNet fuses fine-grained frame-level manual features with coarse-grained utterance-level deep features. Meanwhile, it adopts different speech emotion feature extraction modules according to the temporal and spatial features of speech signals, which enriches features and improves feature characterization. The network consists of a frame-level representation learning module (FRLM) based on the time structure and an utterance-level representation learning module (URLM) based on the global structure. Besides, improved attention-based long short-term memory (LSTM) is introduced into FRLM to focus on the frames that contribute more to the final emotion recognition result. In URLM, a convolutional neural network with the squeeze-and-excitation block (SCNN) is introduced to extract deep features. In addition, the connection attention mechanism is proposed for feature fusion, which applies different weights to different features. Extensive experiments are conducted on the IEMOCAP and EmoDB datasets, and the results demonstrate the effectiveness and performance superiority of AMSNet. Our code will be publicly available at https://codeocean.com/capsule/8636967/tree/v1. Connection attention mechanism Elsevier Features fusion Elsevier Utterance-level features Elsevier Frame-level features Elsevier Speech emotion recognition Elsevier Li, Jiawen oth Liu, Hai oth Wang, Xuyang oth Wang, Hu oth Zheng, Qiuyu oth Enthalten in Elsevier Science Clements, Jody L. ELSEVIER Do denture processing techniques affect the mechanical properties of denture teeth? 2017 an international journal Amsterdam [u.a.] (DE-627)ELV000222070 volume:214 year:2023 day:15 month:03 pages:0 https://doi.org/10.1016/j.eswa.2022.118943 Volltext GBV_USEFLAG_U GBV_ELV SYSFLAG_U SSG-OLC-PHA 44.96 Zahnmedizin VZ AR 214 2023 15 0315 0 |
allfields_unstemmed |
10.1016/j.eswa.2022.118943 doi /cbs_pica/cbs_olc/import_discovery/elsevier/einzuspielen/GBV00000000001986.pica (DE-627)ELV05978976X (ELSEVIER)S0957-4174(22)01961-3 DE-627 ger DE-627 rakwb eng 610 VZ 44.96 bkl Chen, Zengzhao verfasserin aut Learning multi-scale features for speech emotion recognition with connection attention mechanism 2023transfer abstract nicht spezifiziert zzz rdacontent nicht spezifiziert z rdamedia nicht spezifiziert zu rdacarrier Speech emotion recognition (SER) has become a crucial topic in the field of human–computer interactions. Feature representation plays an important role in SER, but there are still many challenges in feature representation such as the inability to predict which features are most effective for SER and the cultural differences in emotion expression. Most previous studies use a single type of feature for the recognition task or conduct early fusion of features. However, a single type of feature cannot well reflect the emotions of speech signals. Also, different features contain different information, direct fusion cannot integrate the advantages of different features. To overcome these challenges, this paper proposes a parallel network for multi-scale SER based on a connection attention mechanism (AMSNet). AMSNet fuses fine-grained frame-level manual features with coarse-grained utterance-level deep features. Meanwhile, it adopts different speech emotion feature extraction modules according to the temporal and spatial features of speech signals, which enriches features and improves feature characterization. The network consists of a frame-level representation learning module (FRLM) based on the time structure and an utterance-level representation learning module (URLM) based on the global structure. Besides, improved attention-based long short-term memory (LSTM) is introduced into FRLM to focus on the frames that contribute more to the final emotion recognition result. In URLM, a convolutional neural network with the squeeze-and-excitation block (SCNN) is introduced to extract deep features. In addition, the connection attention mechanism is proposed for feature fusion, which applies different weights to different features. Extensive experiments are conducted on the IEMOCAP and EmoDB datasets, and the results demonstrate the effectiveness and performance superiority of AMSNet. Our code will be publicly available at https://codeocean.com/capsule/8636967/tree/v1. Speech emotion recognition (SER) has become a crucial topic in the field of human–computer interactions. Feature representation plays an important role in SER, but there are still many challenges in feature representation such as the inability to predict which features are most effective for SER and the cultural differences in emotion expression. Most previous studies use a single type of feature for the recognition task or conduct early fusion of features. However, a single type of feature cannot well reflect the emotions of speech signals. Also, different features contain different information, direct fusion cannot integrate the advantages of different features. To overcome these challenges, this paper proposes a parallel network for multi-scale SER based on a connection attention mechanism (AMSNet). AMSNet fuses fine-grained frame-level manual features with coarse-grained utterance-level deep features. Meanwhile, it adopts different speech emotion feature extraction modules according to the temporal and spatial features of speech signals, which enriches features and improves feature characterization. The network consists of a frame-level representation learning module (FRLM) based on the time structure and an utterance-level representation learning module (URLM) based on the global structure. Besides, improved attention-based long short-term memory (LSTM) is introduced into FRLM to focus on the frames that contribute more to the final emotion recognition result. In URLM, a convolutional neural network with the squeeze-and-excitation block (SCNN) is introduced to extract deep features. In addition, the connection attention mechanism is proposed for feature fusion, which applies different weights to different features. Extensive experiments are conducted on the IEMOCAP and EmoDB datasets, and the results demonstrate the effectiveness and performance superiority of AMSNet. Our code will be publicly available at https://codeocean.com/capsule/8636967/tree/v1. Connection attention mechanism Elsevier Features fusion Elsevier Utterance-level features Elsevier Frame-level features Elsevier Speech emotion recognition Elsevier Li, Jiawen oth Liu, Hai oth Wang, Xuyang oth Wang, Hu oth Zheng, Qiuyu oth Enthalten in Elsevier Science Clements, Jody L. ELSEVIER Do denture processing techniques affect the mechanical properties of denture teeth? 2017 an international journal Amsterdam [u.a.] (DE-627)ELV000222070 volume:214 year:2023 day:15 month:03 pages:0 https://doi.org/10.1016/j.eswa.2022.118943 Volltext GBV_USEFLAG_U GBV_ELV SYSFLAG_U SSG-OLC-PHA 44.96 Zahnmedizin VZ AR 214 2023 15 0315 0 |
allfieldsGer |
10.1016/j.eswa.2022.118943 doi /cbs_pica/cbs_olc/import_discovery/elsevier/einzuspielen/GBV00000000001986.pica (DE-627)ELV05978976X (ELSEVIER)S0957-4174(22)01961-3 DE-627 ger DE-627 rakwb eng 610 VZ 44.96 bkl Chen, Zengzhao verfasserin aut Learning multi-scale features for speech emotion recognition with connection attention mechanism 2023transfer abstract nicht spezifiziert zzz rdacontent nicht spezifiziert z rdamedia nicht spezifiziert zu rdacarrier Speech emotion recognition (SER) has become a crucial topic in the field of human–computer interactions. Feature representation plays an important role in SER, but there are still many challenges in feature representation such as the inability to predict which features are most effective for SER and the cultural differences in emotion expression. Most previous studies use a single type of feature for the recognition task or conduct early fusion of features. However, a single type of feature cannot well reflect the emotions of speech signals. Also, different features contain different information, direct fusion cannot integrate the advantages of different features. To overcome these challenges, this paper proposes a parallel network for multi-scale SER based on a connection attention mechanism (AMSNet). AMSNet fuses fine-grained frame-level manual features with coarse-grained utterance-level deep features. Meanwhile, it adopts different speech emotion feature extraction modules according to the temporal and spatial features of speech signals, which enriches features and improves feature characterization. The network consists of a frame-level representation learning module (FRLM) based on the time structure and an utterance-level representation learning module (URLM) based on the global structure. Besides, improved attention-based long short-term memory (LSTM) is introduced into FRLM to focus on the frames that contribute more to the final emotion recognition result. In URLM, a convolutional neural network with the squeeze-and-excitation block (SCNN) is introduced to extract deep features. In addition, the connection attention mechanism is proposed for feature fusion, which applies different weights to different features. Extensive experiments are conducted on the IEMOCAP and EmoDB datasets, and the results demonstrate the effectiveness and performance superiority of AMSNet. Our code will be publicly available at https://codeocean.com/capsule/8636967/tree/v1. Speech emotion recognition (SER) has become a crucial topic in the field of human–computer interactions. Feature representation plays an important role in SER, but there are still many challenges in feature representation such as the inability to predict which features are most effective for SER and the cultural differences in emotion expression. Most previous studies use a single type of feature for the recognition task or conduct early fusion of features. However, a single type of feature cannot well reflect the emotions of speech signals. Also, different features contain different information, direct fusion cannot integrate the advantages of different features. To overcome these challenges, this paper proposes a parallel network for multi-scale SER based on a connection attention mechanism (AMSNet). AMSNet fuses fine-grained frame-level manual features with coarse-grained utterance-level deep features. Meanwhile, it adopts different speech emotion feature extraction modules according to the temporal and spatial features of speech signals, which enriches features and improves feature characterization. The network consists of a frame-level representation learning module (FRLM) based on the time structure and an utterance-level representation learning module (URLM) based on the global structure. Besides, improved attention-based long short-term memory (LSTM) is introduced into FRLM to focus on the frames that contribute more to the final emotion recognition result. In URLM, a convolutional neural network with the squeeze-and-excitation block (SCNN) is introduced to extract deep features. In addition, the connection attention mechanism is proposed for feature fusion, which applies different weights to different features. Extensive experiments are conducted on the IEMOCAP and EmoDB datasets, and the results demonstrate the effectiveness and performance superiority of AMSNet. Our code will be publicly available at https://codeocean.com/capsule/8636967/tree/v1. Connection attention mechanism Elsevier Features fusion Elsevier Utterance-level features Elsevier Frame-level features Elsevier Speech emotion recognition Elsevier Li, Jiawen oth Liu, Hai oth Wang, Xuyang oth Wang, Hu oth Zheng, Qiuyu oth Enthalten in Elsevier Science Clements, Jody L. ELSEVIER Do denture processing techniques affect the mechanical properties of denture teeth? 2017 an international journal Amsterdam [u.a.] (DE-627)ELV000222070 volume:214 year:2023 day:15 month:03 pages:0 https://doi.org/10.1016/j.eswa.2022.118943 Volltext GBV_USEFLAG_U GBV_ELV SYSFLAG_U SSG-OLC-PHA 44.96 Zahnmedizin VZ AR 214 2023 15 0315 0 |
allfieldsSound |
10.1016/j.eswa.2022.118943 doi /cbs_pica/cbs_olc/import_discovery/elsevier/einzuspielen/GBV00000000001986.pica (DE-627)ELV05978976X (ELSEVIER)S0957-4174(22)01961-3 DE-627 ger DE-627 rakwb eng 610 VZ 44.96 bkl Chen, Zengzhao verfasserin aut Learning multi-scale features for speech emotion recognition with connection attention mechanism 2023transfer abstract nicht spezifiziert zzz rdacontent nicht spezifiziert z rdamedia nicht spezifiziert zu rdacarrier Speech emotion recognition (SER) has become a crucial topic in the field of human–computer interactions. Feature representation plays an important role in SER, but there are still many challenges in feature representation such as the inability to predict which features are most effective for SER and the cultural differences in emotion expression. Most previous studies use a single type of feature for the recognition task or conduct early fusion of features. However, a single type of feature cannot well reflect the emotions of speech signals. Also, different features contain different information, direct fusion cannot integrate the advantages of different features. To overcome these challenges, this paper proposes a parallel network for multi-scale SER based on a connection attention mechanism (AMSNet). AMSNet fuses fine-grained frame-level manual features with coarse-grained utterance-level deep features. Meanwhile, it adopts different speech emotion feature extraction modules according to the temporal and spatial features of speech signals, which enriches features and improves feature characterization. The network consists of a frame-level representation learning module (FRLM) based on the time structure and an utterance-level representation learning module (URLM) based on the global structure. Besides, improved attention-based long short-term memory (LSTM) is introduced into FRLM to focus on the frames that contribute more to the final emotion recognition result. In URLM, a convolutional neural network with the squeeze-and-excitation block (SCNN) is introduced to extract deep features. In addition, the connection attention mechanism is proposed for feature fusion, which applies different weights to different features. Extensive experiments are conducted on the IEMOCAP and EmoDB datasets, and the results demonstrate the effectiveness and performance superiority of AMSNet. Our code will be publicly available at https://codeocean.com/capsule/8636967/tree/v1. Speech emotion recognition (SER) has become a crucial topic in the field of human–computer interactions. Feature representation plays an important role in SER, but there are still many challenges in feature representation such as the inability to predict which features are most effective for SER and the cultural differences in emotion expression. Most previous studies use a single type of feature for the recognition task or conduct early fusion of features. However, a single type of feature cannot well reflect the emotions of speech signals. Also, different features contain different information, direct fusion cannot integrate the advantages of different features. To overcome these challenges, this paper proposes a parallel network for multi-scale SER based on a connection attention mechanism (AMSNet). AMSNet fuses fine-grained frame-level manual features with coarse-grained utterance-level deep features. Meanwhile, it adopts different speech emotion feature extraction modules according to the temporal and spatial features of speech signals, which enriches features and improves feature characterization. The network consists of a frame-level representation learning module (FRLM) based on the time structure and an utterance-level representation learning module (URLM) based on the global structure. Besides, improved attention-based long short-term memory (LSTM) is introduced into FRLM to focus on the frames that contribute more to the final emotion recognition result. In URLM, a convolutional neural network with the squeeze-and-excitation block (SCNN) is introduced to extract deep features. In addition, the connection attention mechanism is proposed for feature fusion, which applies different weights to different features. Extensive experiments are conducted on the IEMOCAP and EmoDB datasets, and the results demonstrate the effectiveness and performance superiority of AMSNet. Our code will be publicly available at https://codeocean.com/capsule/8636967/tree/v1. Connection attention mechanism Elsevier Features fusion Elsevier Utterance-level features Elsevier Frame-level features Elsevier Speech emotion recognition Elsevier Li, Jiawen oth Liu, Hai oth Wang, Xuyang oth Wang, Hu oth Zheng, Qiuyu oth Enthalten in Elsevier Science Clements, Jody L. ELSEVIER Do denture processing techniques affect the mechanical properties of denture teeth? 2017 an international journal Amsterdam [u.a.] (DE-627)ELV000222070 volume:214 year:2023 day:15 month:03 pages:0 https://doi.org/10.1016/j.eswa.2022.118943 Volltext GBV_USEFLAG_U GBV_ELV SYSFLAG_U SSG-OLC-PHA 44.96 Zahnmedizin VZ AR 214 2023 15 0315 0 |
language |
English |
source |
Enthalten in Do denture processing techniques affect the mechanical properties of denture teeth? Amsterdam [u.a.] volume:214 year:2023 day:15 month:03 pages:0 |
sourceStr |
Enthalten in Do denture processing techniques affect the mechanical properties of denture teeth? Amsterdam [u.a.] volume:214 year:2023 day:15 month:03 pages:0 |
format_phy_str_mv |
Article |
bklname |
Zahnmedizin |
institution |
findex.gbv.de |
topic_facet |
Connection attention mechanism Features fusion Utterance-level features Frame-level features Speech emotion recognition |
dewey-raw |
610 |
isfreeaccess_bool |
false |
container_title |
Do denture processing techniques affect the mechanical properties of denture teeth? |
authorswithroles_txt_mv |
Chen, Zengzhao @@aut@@ Li, Jiawen @@oth@@ Liu, Hai @@oth@@ Wang, Xuyang @@oth@@ Wang, Hu @@oth@@ Zheng, Qiuyu @@oth@@ |
publishDateDaySort_date |
2023-01-15T00:00:00Z |
hierarchy_top_id |
ELV000222070 |
dewey-sort |
3610 |
id |
ELV05978976X |
language_de |
englisch |
fullrecord |
<?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>01000caa a22002652 4500</leader><controlfield tag="001">ELV05978976X</controlfield><controlfield tag="003">DE-627</controlfield><controlfield tag="005">20230626053951.0</controlfield><controlfield tag="007">cr uuu---uuuuu</controlfield><controlfield tag="008">221219s2023 xx |||||o 00| ||eng c</controlfield><datafield tag="024" ind1="7" ind2=" "><subfield code="a">10.1016/j.eswa.2022.118943</subfield><subfield code="2">doi</subfield></datafield><datafield tag="028" ind1="5" ind2="2"><subfield code="a">/cbs_pica/cbs_olc/import_discovery/elsevier/einzuspielen/GBV00000000001986.pica</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-627)ELV05978976X</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(ELSEVIER)S0957-4174(22)01961-3</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-627</subfield><subfield code="b">ger</subfield><subfield code="c">DE-627</subfield><subfield code="e">rakwb</subfield></datafield><datafield tag="041" ind1=" " ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="082" ind1="0" ind2="4"><subfield code="a">610</subfield><subfield code="q">VZ</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">44.96</subfield><subfield code="2">bkl</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Chen, Zengzhao</subfield><subfield code="e">verfasserin</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Learning multi-scale features for speech emotion recognition with connection attention mechanism</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="c">2023transfer abstract</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="a">nicht spezifiziert</subfield><subfield code="b">zzz</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="a">nicht spezifiziert</subfield><subfield code="b">z</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="a">nicht spezifiziert</subfield><subfield code="b">zu</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="520" ind1=" " ind2=" "><subfield code="a">Speech emotion recognition (SER) has become a crucial topic in the field of human–computer interactions. Feature representation plays an important role in SER, but there are still many challenges in feature representation such as the inability to predict which features are most effective for SER and the cultural differences in emotion expression. Most previous studies use a single type of feature for the recognition task or conduct early fusion of features. However, a single type of feature cannot well reflect the emotions of speech signals. Also, different features contain different information, direct fusion cannot integrate the advantages of different features. To overcome these challenges, this paper proposes a parallel network for multi-scale SER based on a connection attention mechanism (AMSNet). AMSNet fuses fine-grained frame-level manual features with coarse-grained utterance-level deep features. Meanwhile, it adopts different speech emotion feature extraction modules according to the temporal and spatial features of speech signals, which enriches features and improves feature characterization. The network consists of a frame-level representation learning module (FRLM) based on the time structure and an utterance-level representation learning module (URLM) based on the global structure. Besides, improved attention-based long short-term memory (LSTM) is introduced into FRLM to focus on the frames that contribute more to the final emotion recognition result. In URLM, a convolutional neural network with the squeeze-and-excitation block (SCNN) is introduced to extract deep features. In addition, the connection attention mechanism is proposed for feature fusion, which applies different weights to different features. Extensive experiments are conducted on the IEMOCAP and EmoDB datasets, and the results demonstrate the effectiveness and performance superiority of AMSNet. Our code will be publicly available at https://codeocean.com/capsule/8636967/tree/v1.</subfield></datafield><datafield tag="520" ind1=" " ind2=" "><subfield code="a">Speech emotion recognition (SER) has become a crucial topic in the field of human–computer interactions. Feature representation plays an important role in SER, but there are still many challenges in feature representation such as the inability to predict which features are most effective for SER and the cultural differences in emotion expression. Most previous studies use a single type of feature for the recognition task or conduct early fusion of features. However, a single type of feature cannot well reflect the emotions of speech signals. Also, different features contain different information, direct fusion cannot integrate the advantages of different features. To overcome these challenges, this paper proposes a parallel network for multi-scale SER based on a connection attention mechanism (AMSNet). AMSNet fuses fine-grained frame-level manual features with coarse-grained utterance-level deep features. Meanwhile, it adopts different speech emotion feature extraction modules according to the temporal and spatial features of speech signals, which enriches features and improves feature characterization. The network consists of a frame-level representation learning module (FRLM) based on the time structure and an utterance-level representation learning module (URLM) based on the global structure. Besides, improved attention-based long short-term memory (LSTM) is introduced into FRLM to focus on the frames that contribute more to the final emotion recognition result. In URLM, a convolutional neural network with the squeeze-and-excitation block (SCNN) is introduced to extract deep features. In addition, the connection attention mechanism is proposed for feature fusion, which applies different weights to different features. Extensive experiments are conducted on the IEMOCAP and EmoDB datasets, and the results demonstrate the effectiveness and performance superiority of AMSNet. Our code will be publicly available at https://codeocean.com/capsule/8636967/tree/v1.</subfield></datafield><datafield tag="650" ind1=" " ind2="7"><subfield code="a">Connection attention mechanism</subfield><subfield code="2">Elsevier</subfield></datafield><datafield tag="650" ind1=" " ind2="7"><subfield code="a">Features fusion</subfield><subfield code="2">Elsevier</subfield></datafield><datafield tag="650" ind1=" " ind2="7"><subfield code="a">Utterance-level features</subfield><subfield code="2">Elsevier</subfield></datafield><datafield tag="650" ind1=" " ind2="7"><subfield code="a">Frame-level features</subfield><subfield code="2">Elsevier</subfield></datafield><datafield tag="650" ind1=" " ind2="7"><subfield code="a">Speech emotion recognition</subfield><subfield code="2">Elsevier</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Li, Jiawen</subfield><subfield code="4">oth</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Liu, Hai</subfield><subfield code="4">oth</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Wang, Xuyang</subfield><subfield code="4">oth</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Wang, Hu</subfield><subfield code="4">oth</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Zheng, Qiuyu</subfield><subfield code="4">oth</subfield></datafield><datafield tag="773" ind1="0" ind2="8"><subfield code="i">Enthalten in</subfield><subfield code="n">Elsevier Science</subfield><subfield code="a">Clements, Jody L. ELSEVIER</subfield><subfield code="t">Do denture processing techniques affect the mechanical properties of denture teeth?</subfield><subfield code="d">2017</subfield><subfield code="d">an international journal</subfield><subfield code="g">Amsterdam [u.a.]</subfield><subfield code="w">(DE-627)ELV000222070</subfield></datafield><datafield tag="773" ind1="1" ind2="8"><subfield code="g">volume:214</subfield><subfield code="g">year:2023</subfield><subfield code="g">day:15</subfield><subfield code="g">month:03</subfield><subfield code="g">pages:0</subfield></datafield><datafield tag="856" ind1="4" ind2="0"><subfield code="u">https://doi.org/10.1016/j.eswa.2022.118943</subfield><subfield code="3">Volltext</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_USEFLAG_U</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ELV</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SYSFLAG_U</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SSG-OLC-PHA</subfield></datafield><datafield tag="936" ind1="b" ind2="k"><subfield code="a">44.96</subfield><subfield code="j">Zahnmedizin</subfield><subfield code="q">VZ</subfield></datafield><datafield tag="951" ind1=" " ind2=" "><subfield code="a">AR</subfield></datafield><datafield tag="952" ind1=" " ind2=" "><subfield code="d">214</subfield><subfield code="j">2023</subfield><subfield code="b">15</subfield><subfield code="c">0315</subfield><subfield code="h">0</subfield></datafield></record></collection>
|
author |
Chen, Zengzhao |
spellingShingle |
Chen, Zengzhao ddc 610 bkl 44.96 Elsevier Connection attention mechanism Elsevier Features fusion Elsevier Utterance-level features Elsevier Frame-level features Elsevier Speech emotion recognition Learning multi-scale features for speech emotion recognition with connection attention mechanism |
authorStr |
Chen, Zengzhao |
ppnlink_with_tag_str_mv |
@@773@@(DE-627)ELV000222070 |
format |
electronic Article |
dewey-ones |
610 - Medicine & health |
delete_txt_mv |
keep |
author_role |
aut |
collection |
elsevier |
remote_str |
true |
illustrated |
Not Illustrated |
topic_title |
610 VZ 44.96 bkl Learning multi-scale features for speech emotion recognition with connection attention mechanism Connection attention mechanism Elsevier Features fusion Elsevier Utterance-level features Elsevier Frame-level features Elsevier Speech emotion recognition Elsevier |
topic |
ddc 610 bkl 44.96 Elsevier Connection attention mechanism Elsevier Features fusion Elsevier Utterance-level features Elsevier Frame-level features Elsevier Speech emotion recognition |
topic_unstemmed |
ddc 610 bkl 44.96 Elsevier Connection attention mechanism Elsevier Features fusion Elsevier Utterance-level features Elsevier Frame-level features Elsevier Speech emotion recognition |
topic_browse |
ddc 610 bkl 44.96 Elsevier Connection attention mechanism Elsevier Features fusion Elsevier Utterance-level features Elsevier Frame-level features Elsevier Speech emotion recognition |
format_facet |
Elektronische Aufsätze Aufsätze Elektronische Ressource |
format_main_str_mv |
Text Zeitschrift/Artikel |
carriertype_str_mv |
zu |
author2_variant |
j l jl h l hl x w xw h w hw q z qz |
hierarchy_parent_title |
Do denture processing techniques affect the mechanical properties of denture teeth? |
hierarchy_parent_id |
ELV000222070 |
dewey-tens |
610 - Medicine & health |
hierarchy_top_title |
Do denture processing techniques affect the mechanical properties of denture teeth? |
isfreeaccess_txt |
false |
familylinks_str_mv |
(DE-627)ELV000222070 |
title |
Learning multi-scale features for speech emotion recognition with connection attention mechanism |
ctrlnum |
(DE-627)ELV05978976X (ELSEVIER)S0957-4174(22)01961-3 |
title_full |
Learning multi-scale features for speech emotion recognition with connection attention mechanism |
author_sort |
Chen, Zengzhao |
journal |
Do denture processing techniques affect the mechanical properties of denture teeth? |
journalStr |
Do denture processing techniques affect the mechanical properties of denture teeth? |
lang_code |
eng |
isOA_bool |
false |
dewey-hundreds |
600 - Technology |
recordtype |
marc |
publishDateSort |
2023 |
contenttype_str_mv |
zzz |
container_start_page |
0 |
author_browse |
Chen, Zengzhao |
container_volume |
214 |
class |
610 VZ 44.96 bkl |
format_se |
Elektronische Aufsätze |
author-letter |
Chen, Zengzhao |
doi_str_mv |
10.1016/j.eswa.2022.118943 |
dewey-full |
610 |
title_sort |
learning multi-scale features for speech emotion recognition with connection attention mechanism |
title_auth |
Learning multi-scale features for speech emotion recognition with connection attention mechanism |
abstract |
Speech emotion recognition (SER) has become a crucial topic in the field of human–computer interactions. Feature representation plays an important role in SER, but there are still many challenges in feature representation such as the inability to predict which features are most effective for SER and the cultural differences in emotion expression. Most previous studies use a single type of feature for the recognition task or conduct early fusion of features. However, a single type of feature cannot well reflect the emotions of speech signals. Also, different features contain different information, direct fusion cannot integrate the advantages of different features. To overcome these challenges, this paper proposes a parallel network for multi-scale SER based on a connection attention mechanism (AMSNet). AMSNet fuses fine-grained frame-level manual features with coarse-grained utterance-level deep features. Meanwhile, it adopts different speech emotion feature extraction modules according to the temporal and spatial features of speech signals, which enriches features and improves feature characterization. The network consists of a frame-level representation learning module (FRLM) based on the time structure and an utterance-level representation learning module (URLM) based on the global structure. Besides, improved attention-based long short-term memory (LSTM) is introduced into FRLM to focus on the frames that contribute more to the final emotion recognition result. In URLM, a convolutional neural network with the squeeze-and-excitation block (SCNN) is introduced to extract deep features. In addition, the connection attention mechanism is proposed for feature fusion, which applies different weights to different features. Extensive experiments are conducted on the IEMOCAP and EmoDB datasets, and the results demonstrate the effectiveness and performance superiority of AMSNet. Our code will be publicly available at https://codeocean.com/capsule/8636967/tree/v1. |
abstractGer |
Speech emotion recognition (SER) has become a crucial topic in the field of human–computer interactions. Feature representation plays an important role in SER, but there are still many challenges in feature representation such as the inability to predict which features are most effective for SER and the cultural differences in emotion expression. Most previous studies use a single type of feature for the recognition task or conduct early fusion of features. However, a single type of feature cannot well reflect the emotions of speech signals. Also, different features contain different information, direct fusion cannot integrate the advantages of different features. To overcome these challenges, this paper proposes a parallel network for multi-scale SER based on a connection attention mechanism (AMSNet). AMSNet fuses fine-grained frame-level manual features with coarse-grained utterance-level deep features. Meanwhile, it adopts different speech emotion feature extraction modules according to the temporal and spatial features of speech signals, which enriches features and improves feature characterization. The network consists of a frame-level representation learning module (FRLM) based on the time structure and an utterance-level representation learning module (URLM) based on the global structure. Besides, improved attention-based long short-term memory (LSTM) is introduced into FRLM to focus on the frames that contribute more to the final emotion recognition result. In URLM, a convolutional neural network with the squeeze-and-excitation block (SCNN) is introduced to extract deep features. In addition, the connection attention mechanism is proposed for feature fusion, which applies different weights to different features. Extensive experiments are conducted on the IEMOCAP and EmoDB datasets, and the results demonstrate the effectiveness and performance superiority of AMSNet. Our code will be publicly available at https://codeocean.com/capsule/8636967/tree/v1. |
abstract_unstemmed |
Speech emotion recognition (SER) has become a crucial topic in the field of human–computer interactions. Feature representation plays an important role in SER, but there are still many challenges in feature representation such as the inability to predict which features are most effective for SER and the cultural differences in emotion expression. Most previous studies use a single type of feature for the recognition task or conduct early fusion of features. However, a single type of feature cannot well reflect the emotions of speech signals. Also, different features contain different information, direct fusion cannot integrate the advantages of different features. To overcome these challenges, this paper proposes a parallel network for multi-scale SER based on a connection attention mechanism (AMSNet). AMSNet fuses fine-grained frame-level manual features with coarse-grained utterance-level deep features. Meanwhile, it adopts different speech emotion feature extraction modules according to the temporal and spatial features of speech signals, which enriches features and improves feature characterization. The network consists of a frame-level representation learning module (FRLM) based on the time structure and an utterance-level representation learning module (URLM) based on the global structure. Besides, improved attention-based long short-term memory (LSTM) is introduced into FRLM to focus on the frames that contribute more to the final emotion recognition result. In URLM, a convolutional neural network with the squeeze-and-excitation block (SCNN) is introduced to extract deep features. In addition, the connection attention mechanism is proposed for feature fusion, which applies different weights to different features. Extensive experiments are conducted on the IEMOCAP and EmoDB datasets, and the results demonstrate the effectiveness and performance superiority of AMSNet. Our code will be publicly available at https://codeocean.com/capsule/8636967/tree/v1. |
collection_details |
GBV_USEFLAG_U GBV_ELV SYSFLAG_U SSG-OLC-PHA |
title_short |
Learning multi-scale features for speech emotion recognition with connection attention mechanism |
url |
https://doi.org/10.1016/j.eswa.2022.118943 |
remote_bool |
true |
author2 |
Li, Jiawen Liu, Hai Wang, Xuyang Wang, Hu Zheng, Qiuyu |
author2Str |
Li, Jiawen Liu, Hai Wang, Xuyang Wang, Hu Zheng, Qiuyu |
ppnlink |
ELV000222070 |
mediatype_str_mv |
z |
isOA_txt |
false |
hochschulschrift_bool |
false |
author2_role |
oth oth oth oth oth |
doi_str |
10.1016/j.eswa.2022.118943 |
up_date |
2024-07-06T23:03:40.541Z |
_version_ |
1803872654371848192 |
fullrecord_marcxml |
<?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>01000caa a22002652 4500</leader><controlfield tag="001">ELV05978976X</controlfield><controlfield tag="003">DE-627</controlfield><controlfield tag="005">20230626053951.0</controlfield><controlfield tag="007">cr uuu---uuuuu</controlfield><controlfield tag="008">221219s2023 xx |||||o 00| ||eng c</controlfield><datafield tag="024" ind1="7" ind2=" "><subfield code="a">10.1016/j.eswa.2022.118943</subfield><subfield code="2">doi</subfield></datafield><datafield tag="028" ind1="5" ind2="2"><subfield code="a">/cbs_pica/cbs_olc/import_discovery/elsevier/einzuspielen/GBV00000000001986.pica</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-627)ELV05978976X</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(ELSEVIER)S0957-4174(22)01961-3</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-627</subfield><subfield code="b">ger</subfield><subfield code="c">DE-627</subfield><subfield code="e">rakwb</subfield></datafield><datafield tag="041" ind1=" " ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="082" ind1="0" ind2="4"><subfield code="a">610</subfield><subfield code="q">VZ</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">44.96</subfield><subfield code="2">bkl</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Chen, Zengzhao</subfield><subfield code="e">verfasserin</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Learning multi-scale features for speech emotion recognition with connection attention mechanism</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="c">2023transfer abstract</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="a">nicht spezifiziert</subfield><subfield code="b">zzz</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="a">nicht spezifiziert</subfield><subfield code="b">z</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="a">nicht spezifiziert</subfield><subfield code="b">zu</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="520" ind1=" " ind2=" "><subfield code="a">Speech emotion recognition (SER) has become a crucial topic in the field of human–computer interactions. Feature representation plays an important role in SER, but there are still many challenges in feature representation such as the inability to predict which features are most effective for SER and the cultural differences in emotion expression. Most previous studies use a single type of feature for the recognition task or conduct early fusion of features. However, a single type of feature cannot well reflect the emotions of speech signals. Also, different features contain different information, direct fusion cannot integrate the advantages of different features. To overcome these challenges, this paper proposes a parallel network for multi-scale SER based on a connection attention mechanism (AMSNet). AMSNet fuses fine-grained frame-level manual features with coarse-grained utterance-level deep features. Meanwhile, it adopts different speech emotion feature extraction modules according to the temporal and spatial features of speech signals, which enriches features and improves feature characterization. The network consists of a frame-level representation learning module (FRLM) based on the time structure and an utterance-level representation learning module (URLM) based on the global structure. Besides, improved attention-based long short-term memory (LSTM) is introduced into FRLM to focus on the frames that contribute more to the final emotion recognition result. In URLM, a convolutional neural network with the squeeze-and-excitation block (SCNN) is introduced to extract deep features. In addition, the connection attention mechanism is proposed for feature fusion, which applies different weights to different features. Extensive experiments are conducted on the IEMOCAP and EmoDB datasets, and the results demonstrate the effectiveness and performance superiority of AMSNet. Our code will be publicly available at https://codeocean.com/capsule/8636967/tree/v1.</subfield></datafield><datafield tag="520" ind1=" " ind2=" "><subfield code="a">Speech emotion recognition (SER) has become a crucial topic in the field of human–computer interactions. Feature representation plays an important role in SER, but there are still many challenges in feature representation such as the inability to predict which features are most effective for SER and the cultural differences in emotion expression. Most previous studies use a single type of feature for the recognition task or conduct early fusion of features. However, a single type of feature cannot well reflect the emotions of speech signals. Also, different features contain different information, direct fusion cannot integrate the advantages of different features. To overcome these challenges, this paper proposes a parallel network for multi-scale SER based on a connection attention mechanism (AMSNet). AMSNet fuses fine-grained frame-level manual features with coarse-grained utterance-level deep features. Meanwhile, it adopts different speech emotion feature extraction modules according to the temporal and spatial features of speech signals, which enriches features and improves feature characterization. The network consists of a frame-level representation learning module (FRLM) based on the time structure and an utterance-level representation learning module (URLM) based on the global structure. Besides, improved attention-based long short-term memory (LSTM) is introduced into FRLM to focus on the frames that contribute more to the final emotion recognition result. In URLM, a convolutional neural network with the squeeze-and-excitation block (SCNN) is introduced to extract deep features. In addition, the connection attention mechanism is proposed for feature fusion, which applies different weights to different features. Extensive experiments are conducted on the IEMOCAP and EmoDB datasets, and the results demonstrate the effectiveness and performance superiority of AMSNet. Our code will be publicly available at https://codeocean.com/capsule/8636967/tree/v1.</subfield></datafield><datafield tag="650" ind1=" " ind2="7"><subfield code="a">Connection attention mechanism</subfield><subfield code="2">Elsevier</subfield></datafield><datafield tag="650" ind1=" " ind2="7"><subfield code="a">Features fusion</subfield><subfield code="2">Elsevier</subfield></datafield><datafield tag="650" ind1=" " ind2="7"><subfield code="a">Utterance-level features</subfield><subfield code="2">Elsevier</subfield></datafield><datafield tag="650" ind1=" " ind2="7"><subfield code="a">Frame-level features</subfield><subfield code="2">Elsevier</subfield></datafield><datafield tag="650" ind1=" " ind2="7"><subfield code="a">Speech emotion recognition</subfield><subfield code="2">Elsevier</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Li, Jiawen</subfield><subfield code="4">oth</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Liu, Hai</subfield><subfield code="4">oth</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Wang, Xuyang</subfield><subfield code="4">oth</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Wang, Hu</subfield><subfield code="4">oth</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Zheng, Qiuyu</subfield><subfield code="4">oth</subfield></datafield><datafield tag="773" ind1="0" ind2="8"><subfield code="i">Enthalten in</subfield><subfield code="n">Elsevier Science</subfield><subfield code="a">Clements, Jody L. ELSEVIER</subfield><subfield code="t">Do denture processing techniques affect the mechanical properties of denture teeth?</subfield><subfield code="d">2017</subfield><subfield code="d">an international journal</subfield><subfield code="g">Amsterdam [u.a.]</subfield><subfield code="w">(DE-627)ELV000222070</subfield></datafield><datafield tag="773" ind1="1" ind2="8"><subfield code="g">volume:214</subfield><subfield code="g">year:2023</subfield><subfield code="g">day:15</subfield><subfield code="g">month:03</subfield><subfield code="g">pages:0</subfield></datafield><datafield tag="856" ind1="4" ind2="0"><subfield code="u">https://doi.org/10.1016/j.eswa.2022.118943</subfield><subfield code="3">Volltext</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_USEFLAG_U</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ELV</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SYSFLAG_U</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SSG-OLC-PHA</subfield></datafield><datafield tag="936" ind1="b" ind2="k"><subfield code="a">44.96</subfield><subfield code="j">Zahnmedizin</subfield><subfield code="q">VZ</subfield></datafield><datafield tag="951" ind1=" " ind2=" "><subfield code="a">AR</subfield></datafield><datafield tag="952" ind1=" " ind2=" "><subfield code="d">214</subfield><subfield code="j">2023</subfield><subfield code="b">15</subfield><subfield code="c">0315</subfield><subfield code="h">0</subfield></datafield></record></collection>
|
score |
7.399357 |