Token replacement-based data augmentation methods for hate speech detection
Abstract Hate speech detection mostly involves the use of text data. This data, usually sourced from various social media platforms, have been known to be plagued with numerous issues that result in a reduction of its quality and hence, the quality of the trained models. Some of these issues are the...
Ausführliche Beschreibung
Autor*in: |
Madukwe, Kosisochukwu Judith [verfasserIn] |
---|
Format: |
E-Artikel |
---|---|
Sprache: |
Englisch |
Erschienen: |
2022 |
---|
Schlagwörter: |
---|
Anmerkung: |
© The Author(s) 2022 |
---|
Übergeordnetes Werk: |
Enthalten in: World wide web - Dordrecht [u.a.] : Springer Science + Business Media B.V, 1998, 25(2022), 3 vom: 01. März, Seite 1129-1150 |
---|---|
Übergeordnetes Werk: |
volume:25 ; year:2022 ; number:3 ; day:01 ; month:03 ; pages:1129-1150 |
Links: |
---|
DOI / URN: |
10.1007/s11280-022-01025-2 |
---|
Katalog-ID: |
SPR046971947 |
---|
LEADER | 01000caa a22002652 4500 | ||
---|---|---|---|
001 | SPR046971947 | ||
003 | DE-627 | ||
005 | 20230507180103.0 | ||
007 | cr uuu---uuuuu | ||
008 | 220512s2022 xx |||||o 00| ||eng c | ||
024 | 7 | |a 10.1007/s11280-022-01025-2 |2 doi | |
035 | |a (DE-627)SPR046971947 | ||
035 | |a (SPR)s11280-022-01025-2-e | ||
040 | |a DE-627 |b ger |c DE-627 |e rakwb | ||
041 | |a eng | ||
100 | 1 | |a Madukwe, Kosisochukwu Judith |e verfasserin |0 (orcid)0000-0003-1641-6767 |4 aut | |
245 | 1 | 0 | |a Token replacement-based data augmentation methods for hate speech detection |
264 | 1 | |c 2022 | |
336 | |a Text |b txt |2 rdacontent | ||
337 | |a Computermedien |b c |2 rdamedia | ||
338 | |a Online-Ressource |b cr |2 rdacarrier | ||
500 | |a © The Author(s) 2022 | ||
520 | |a Abstract Hate speech detection mostly involves the use of text data. This data, usually sourced from various social media platforms, have been known to be plagued with numerous issues that result in a reduction of its quality and hence, the quality of the trained models. Some of these issues are the lack of diversity and the diminutive class of interest in the dataset which results in overfitted models that do not generalize well on other or newly collected data. The different ways of handling these issues include augmenting the data with diverse samples, engineering non-redundant features or designing robust classification models. In this study, the focus is on the data augmentation aspect. Data augmentation is a popular method for improving the quality of existing datasets by generating synthetic samples that mimic the distribution of the original samples. There is a lack of extensive studies on how hate speech texts respond to varying textual data augmentation techniques and methods. Specifically, we provide further insight into the token replacement method of textual data augmentation by performing empirical studies that investigate which embedding method(s) is a robust source of synonym for replacement process, what effective method(s) can be used to select words to be replaced, and how to confirm if the label within each class is preserved. Our proposed methods, validated on two commonly used hate speech datasets affected by a known lack of diversity and diminutive class of interest issues, significantly improve classification performance and provides insights into token replacement methods. | ||
650 | 4 | |a Hate speech data |7 (dpeaa)DE-He213 | |
650 | 4 | |a Data augmentation |7 (dpeaa)DE-He213 | |
650 | 4 | |a Token substitution |7 (dpeaa)DE-He213 | |
650 | 4 | |a Word replacement |7 (dpeaa)DE-He213 | |
650 | 4 | |a Data generation |7 (dpeaa)DE-He213 | |
650 | 4 | |a Text data |7 (dpeaa)DE-He213 | |
700 | 1 | |a Gao, Xiaoying |4 aut | |
700 | 1 | |a Xue, Bing |4 aut | |
773 | 0 | 8 | |i Enthalten in |t World wide web |d Dordrecht [u.a.] : Springer Science + Business Media B.V, 1998 |g 25(2022), 3 vom: 01. März, Seite 1129-1150 |w (DE-627)319530604 |w (DE-600)2025142-7 |x 1573-1413 |7 nnns |
773 | 1 | 8 | |g volume:25 |g year:2022 |g number:3 |g day:01 |g month:03 |g pages:1129-1150 |
856 | 4 | 0 | |u https://dx.doi.org/10.1007/s11280-022-01025-2 |z kostenfrei |3 Volltext |
912 | |a GBV_USEFLAG_A | ||
912 | |a SYSFLAG_A | ||
912 | |a GBV_SPRINGER | ||
912 | |a GBV_ILN_11 | ||
912 | |a GBV_ILN_20 | ||
912 | |a GBV_ILN_22 | ||
912 | |a GBV_ILN_23 | ||
912 | |a GBV_ILN_24 | ||
912 | |a GBV_ILN_31 | ||
912 | |a GBV_ILN_32 | ||
912 | |a GBV_ILN_39 | ||
912 | |a GBV_ILN_40 | ||
912 | |a GBV_ILN_60 | ||
912 | |a GBV_ILN_62 | ||
912 | |a GBV_ILN_63 | ||
912 | |a GBV_ILN_69 | ||
912 | |a GBV_ILN_70 | ||
912 | |a GBV_ILN_73 | ||
912 | |a GBV_ILN_74 | ||
912 | |a GBV_ILN_90 | ||
912 | |a GBV_ILN_95 | ||
912 | |a GBV_ILN_100 | ||
912 | |a GBV_ILN_101 | ||
912 | |a GBV_ILN_105 | ||
912 | |a GBV_ILN_110 | ||
912 | |a GBV_ILN_120 | ||
912 | |a GBV_ILN_138 | ||
912 | |a GBV_ILN_150 | ||
912 | |a GBV_ILN_151 | ||
912 | |a GBV_ILN_152 | ||
912 | |a GBV_ILN_161 | ||
912 | |a GBV_ILN_170 | ||
912 | |a GBV_ILN_171 | ||
912 | |a GBV_ILN_187 | ||
912 | |a GBV_ILN_213 | ||
912 | |a GBV_ILN_224 | ||
912 | |a GBV_ILN_230 | ||
912 | |a GBV_ILN_250 | ||
912 | |a GBV_ILN_281 | ||
912 | |a GBV_ILN_285 | ||
912 | |a GBV_ILN_293 | ||
912 | |a GBV_ILN_370 | ||
912 | |a GBV_ILN_602 | ||
912 | |a GBV_ILN_636 | ||
912 | |a GBV_ILN_702 | ||
912 | |a GBV_ILN_2001 | ||
912 | |a GBV_ILN_2003 | ||
912 | |a GBV_ILN_2004 | ||
912 | |a GBV_ILN_2005 | ||
912 | |a GBV_ILN_2006 | ||
912 | |a GBV_ILN_2007 | ||
912 | |a GBV_ILN_2008 | ||
912 | |a GBV_ILN_2009 | ||
912 | |a GBV_ILN_2010 | ||
912 | |a GBV_ILN_2011 | ||
912 | |a GBV_ILN_2014 | ||
912 | |a GBV_ILN_2015 | ||
912 | |a GBV_ILN_2020 | ||
912 | |a GBV_ILN_2021 | ||
912 | |a GBV_ILN_2025 | ||
912 | |a GBV_ILN_2026 | ||
912 | |a GBV_ILN_2027 | ||
912 | |a GBV_ILN_2031 | ||
912 | |a GBV_ILN_2034 | ||
912 | |a GBV_ILN_2037 | ||
912 | |a GBV_ILN_2038 | ||
912 | |a GBV_ILN_2039 | ||
912 | |a GBV_ILN_2044 | ||
912 | |a GBV_ILN_2048 | ||
912 | |a GBV_ILN_2049 | ||
912 | |a GBV_ILN_2050 | ||
912 | |a GBV_ILN_2055 | ||
912 | |a GBV_ILN_2056 | ||
912 | |a GBV_ILN_2057 | ||
912 | |a GBV_ILN_2059 | ||
912 | |a GBV_ILN_2061 | ||
912 | |a GBV_ILN_2064 | ||
912 | |a GBV_ILN_2065 | ||
912 | |a GBV_ILN_2068 | ||
912 | |a GBV_ILN_2088 | ||
912 | |a GBV_ILN_2093 | ||
912 | |a GBV_ILN_2106 | ||
912 | |a GBV_ILN_2107 | ||
912 | |a GBV_ILN_2108 | ||
912 | |a GBV_ILN_2110 | ||
912 | |a GBV_ILN_2111 | ||
912 | |a GBV_ILN_2112 | ||
912 | |a GBV_ILN_2113 | ||
912 | |a GBV_ILN_2118 | ||
912 | |a GBV_ILN_2122 | ||
912 | |a GBV_ILN_2129 | ||
912 | |a GBV_ILN_2143 | ||
912 | |a GBV_ILN_2144 | ||
912 | |a GBV_ILN_2147 | ||
912 | |a GBV_ILN_2148 | ||
912 | |a GBV_ILN_2152 | ||
912 | |a GBV_ILN_2153 | ||
912 | |a GBV_ILN_2188 | ||
912 | |a GBV_ILN_2190 | ||
912 | |a GBV_ILN_2232 | ||
912 | |a GBV_ILN_2336 | ||
912 | |a GBV_ILN_2446 | ||
912 | |a GBV_ILN_2470 | ||
912 | |a GBV_ILN_2472 | ||
912 | |a GBV_ILN_2507 | ||
912 | |a GBV_ILN_2522 | ||
912 | |a GBV_ILN_2548 | ||
912 | |a GBV_ILN_4035 | ||
912 | |a GBV_ILN_4037 | ||
912 | |a GBV_ILN_4046 | ||
912 | |a GBV_ILN_4112 | ||
912 | |a GBV_ILN_4125 | ||
912 | |a GBV_ILN_4126 | ||
912 | |a GBV_ILN_4242 | ||
912 | |a GBV_ILN_4246 | ||
912 | |a GBV_ILN_4249 | ||
912 | |a GBV_ILN_4251 | ||
912 | |a GBV_ILN_4305 | ||
912 | |a GBV_ILN_4306 | ||
912 | |a GBV_ILN_4307 | ||
912 | |a GBV_ILN_4313 | ||
912 | |a GBV_ILN_4322 | ||
912 | |a GBV_ILN_4323 | ||
912 | |a GBV_ILN_4324 | ||
912 | |a GBV_ILN_4325 | ||
912 | |a GBV_ILN_4326 | ||
912 | |a GBV_ILN_4328 | ||
912 | |a GBV_ILN_4333 | ||
912 | |a GBV_ILN_4334 | ||
912 | |a GBV_ILN_4335 | ||
912 | |a GBV_ILN_4336 | ||
912 | |a GBV_ILN_4338 | ||
912 | |a GBV_ILN_4393 | ||
912 | |a GBV_ILN_4700 | ||
951 | |a AR | ||
952 | |d 25 |j 2022 |e 3 |b 01 |c 03 |h 1129-1150 |
author_variant |
k j m kj kjm x g xg b x bx |
---|---|
matchkey_str |
article:15731413:2022----::oerpaeetaedtagettomtosoh |
hierarchy_sort_str |
2022 |
publishDate |
2022 |
allfields |
10.1007/s11280-022-01025-2 doi (DE-627)SPR046971947 (SPR)s11280-022-01025-2-e DE-627 ger DE-627 rakwb eng Madukwe, Kosisochukwu Judith verfasserin (orcid)0000-0003-1641-6767 aut Token replacement-based data augmentation methods for hate speech detection 2022 Text txt rdacontent Computermedien c rdamedia Online-Ressource cr rdacarrier © The Author(s) 2022 Abstract Hate speech detection mostly involves the use of text data. This data, usually sourced from various social media platforms, have been known to be plagued with numerous issues that result in a reduction of its quality and hence, the quality of the trained models. Some of these issues are the lack of diversity and the diminutive class of interest in the dataset which results in overfitted models that do not generalize well on other or newly collected data. The different ways of handling these issues include augmenting the data with diverse samples, engineering non-redundant features or designing robust classification models. In this study, the focus is on the data augmentation aspect. Data augmentation is a popular method for improving the quality of existing datasets by generating synthetic samples that mimic the distribution of the original samples. There is a lack of extensive studies on how hate speech texts respond to varying textual data augmentation techniques and methods. Specifically, we provide further insight into the token replacement method of textual data augmentation by performing empirical studies that investigate which embedding method(s) is a robust source of synonym for replacement process, what effective method(s) can be used to select words to be replaced, and how to confirm if the label within each class is preserved. Our proposed methods, validated on two commonly used hate speech datasets affected by a known lack of diversity and diminutive class of interest issues, significantly improve classification performance and provides insights into token replacement methods. Hate speech data (dpeaa)DE-He213 Data augmentation (dpeaa)DE-He213 Token substitution (dpeaa)DE-He213 Word replacement (dpeaa)DE-He213 Data generation (dpeaa)DE-He213 Text data (dpeaa)DE-He213 Gao, Xiaoying aut Xue, Bing aut Enthalten in World wide web Dordrecht [u.a.] : Springer Science + Business Media B.V, 1998 25(2022), 3 vom: 01. März, Seite 1129-1150 (DE-627)319530604 (DE-600)2025142-7 1573-1413 nnns volume:25 year:2022 number:3 day:01 month:03 pages:1129-1150 https://dx.doi.org/10.1007/s11280-022-01025-2 kostenfrei Volltext GBV_USEFLAG_A SYSFLAG_A GBV_SPRINGER GBV_ILN_11 GBV_ILN_20 GBV_ILN_22 GBV_ILN_23 GBV_ILN_24 GBV_ILN_31 GBV_ILN_32 GBV_ILN_39 GBV_ILN_40 GBV_ILN_60 GBV_ILN_62 GBV_ILN_63 GBV_ILN_69 GBV_ILN_70 GBV_ILN_73 GBV_ILN_74 GBV_ILN_90 GBV_ILN_95 GBV_ILN_100 GBV_ILN_101 GBV_ILN_105 GBV_ILN_110 GBV_ILN_120 GBV_ILN_138 GBV_ILN_150 GBV_ILN_151 GBV_ILN_152 GBV_ILN_161 GBV_ILN_170 GBV_ILN_171 GBV_ILN_187 GBV_ILN_213 GBV_ILN_224 GBV_ILN_230 GBV_ILN_250 GBV_ILN_281 GBV_ILN_285 GBV_ILN_293 GBV_ILN_370 GBV_ILN_602 GBV_ILN_636 GBV_ILN_702 GBV_ILN_2001 GBV_ILN_2003 GBV_ILN_2004 GBV_ILN_2005 GBV_ILN_2006 GBV_ILN_2007 GBV_ILN_2008 GBV_ILN_2009 GBV_ILN_2010 GBV_ILN_2011 GBV_ILN_2014 GBV_ILN_2015 GBV_ILN_2020 GBV_ILN_2021 GBV_ILN_2025 GBV_ILN_2026 GBV_ILN_2027 GBV_ILN_2031 GBV_ILN_2034 GBV_ILN_2037 GBV_ILN_2038 GBV_ILN_2039 GBV_ILN_2044 GBV_ILN_2048 GBV_ILN_2049 GBV_ILN_2050 GBV_ILN_2055 GBV_ILN_2056 GBV_ILN_2057 GBV_ILN_2059 GBV_ILN_2061 GBV_ILN_2064 GBV_ILN_2065 GBV_ILN_2068 GBV_ILN_2088 GBV_ILN_2093 GBV_ILN_2106 GBV_ILN_2107 GBV_ILN_2108 GBV_ILN_2110 GBV_ILN_2111 GBV_ILN_2112 GBV_ILN_2113 GBV_ILN_2118 GBV_ILN_2122 GBV_ILN_2129 GBV_ILN_2143 GBV_ILN_2144 GBV_ILN_2147 GBV_ILN_2148 GBV_ILN_2152 GBV_ILN_2153 GBV_ILN_2188 GBV_ILN_2190 GBV_ILN_2232 GBV_ILN_2336 GBV_ILN_2446 GBV_ILN_2470 GBV_ILN_2472 GBV_ILN_2507 GBV_ILN_2522 GBV_ILN_2548 GBV_ILN_4035 GBV_ILN_4037 GBV_ILN_4046 GBV_ILN_4112 GBV_ILN_4125 GBV_ILN_4126 GBV_ILN_4242 GBV_ILN_4246 GBV_ILN_4249 GBV_ILN_4251 GBV_ILN_4305 GBV_ILN_4306 GBV_ILN_4307 GBV_ILN_4313 GBV_ILN_4322 GBV_ILN_4323 GBV_ILN_4324 GBV_ILN_4325 GBV_ILN_4326 GBV_ILN_4328 GBV_ILN_4333 GBV_ILN_4334 GBV_ILN_4335 GBV_ILN_4336 GBV_ILN_4338 GBV_ILN_4393 GBV_ILN_4700 AR 25 2022 3 01 03 1129-1150 |
spelling |
10.1007/s11280-022-01025-2 doi (DE-627)SPR046971947 (SPR)s11280-022-01025-2-e DE-627 ger DE-627 rakwb eng Madukwe, Kosisochukwu Judith verfasserin (orcid)0000-0003-1641-6767 aut Token replacement-based data augmentation methods for hate speech detection 2022 Text txt rdacontent Computermedien c rdamedia Online-Ressource cr rdacarrier © The Author(s) 2022 Abstract Hate speech detection mostly involves the use of text data. This data, usually sourced from various social media platforms, have been known to be plagued with numerous issues that result in a reduction of its quality and hence, the quality of the trained models. Some of these issues are the lack of diversity and the diminutive class of interest in the dataset which results in overfitted models that do not generalize well on other or newly collected data. The different ways of handling these issues include augmenting the data with diverse samples, engineering non-redundant features or designing robust classification models. In this study, the focus is on the data augmentation aspect. Data augmentation is a popular method for improving the quality of existing datasets by generating synthetic samples that mimic the distribution of the original samples. There is a lack of extensive studies on how hate speech texts respond to varying textual data augmentation techniques and methods. Specifically, we provide further insight into the token replacement method of textual data augmentation by performing empirical studies that investigate which embedding method(s) is a robust source of synonym for replacement process, what effective method(s) can be used to select words to be replaced, and how to confirm if the label within each class is preserved. Our proposed methods, validated on two commonly used hate speech datasets affected by a known lack of diversity and diminutive class of interest issues, significantly improve classification performance and provides insights into token replacement methods. Hate speech data (dpeaa)DE-He213 Data augmentation (dpeaa)DE-He213 Token substitution (dpeaa)DE-He213 Word replacement (dpeaa)DE-He213 Data generation (dpeaa)DE-He213 Text data (dpeaa)DE-He213 Gao, Xiaoying aut Xue, Bing aut Enthalten in World wide web Dordrecht [u.a.] : Springer Science + Business Media B.V, 1998 25(2022), 3 vom: 01. März, Seite 1129-1150 (DE-627)319530604 (DE-600)2025142-7 1573-1413 nnns volume:25 year:2022 number:3 day:01 month:03 pages:1129-1150 https://dx.doi.org/10.1007/s11280-022-01025-2 kostenfrei Volltext GBV_USEFLAG_A SYSFLAG_A GBV_SPRINGER GBV_ILN_11 GBV_ILN_20 GBV_ILN_22 GBV_ILN_23 GBV_ILN_24 GBV_ILN_31 GBV_ILN_32 GBV_ILN_39 GBV_ILN_40 GBV_ILN_60 GBV_ILN_62 GBV_ILN_63 GBV_ILN_69 GBV_ILN_70 GBV_ILN_73 GBV_ILN_74 GBV_ILN_90 GBV_ILN_95 GBV_ILN_100 GBV_ILN_101 GBV_ILN_105 GBV_ILN_110 GBV_ILN_120 GBV_ILN_138 GBV_ILN_150 GBV_ILN_151 GBV_ILN_152 GBV_ILN_161 GBV_ILN_170 GBV_ILN_171 GBV_ILN_187 GBV_ILN_213 GBV_ILN_224 GBV_ILN_230 GBV_ILN_250 GBV_ILN_281 GBV_ILN_285 GBV_ILN_293 GBV_ILN_370 GBV_ILN_602 GBV_ILN_636 GBV_ILN_702 GBV_ILN_2001 GBV_ILN_2003 GBV_ILN_2004 GBV_ILN_2005 GBV_ILN_2006 GBV_ILN_2007 GBV_ILN_2008 GBV_ILN_2009 GBV_ILN_2010 GBV_ILN_2011 GBV_ILN_2014 GBV_ILN_2015 GBV_ILN_2020 GBV_ILN_2021 GBV_ILN_2025 GBV_ILN_2026 GBV_ILN_2027 GBV_ILN_2031 GBV_ILN_2034 GBV_ILN_2037 GBV_ILN_2038 GBV_ILN_2039 GBV_ILN_2044 GBV_ILN_2048 GBV_ILN_2049 GBV_ILN_2050 GBV_ILN_2055 GBV_ILN_2056 GBV_ILN_2057 GBV_ILN_2059 GBV_ILN_2061 GBV_ILN_2064 GBV_ILN_2065 GBV_ILN_2068 GBV_ILN_2088 GBV_ILN_2093 GBV_ILN_2106 GBV_ILN_2107 GBV_ILN_2108 GBV_ILN_2110 GBV_ILN_2111 GBV_ILN_2112 GBV_ILN_2113 GBV_ILN_2118 GBV_ILN_2122 GBV_ILN_2129 GBV_ILN_2143 GBV_ILN_2144 GBV_ILN_2147 GBV_ILN_2148 GBV_ILN_2152 GBV_ILN_2153 GBV_ILN_2188 GBV_ILN_2190 GBV_ILN_2232 GBV_ILN_2336 GBV_ILN_2446 GBV_ILN_2470 GBV_ILN_2472 GBV_ILN_2507 GBV_ILN_2522 GBV_ILN_2548 GBV_ILN_4035 GBV_ILN_4037 GBV_ILN_4046 GBV_ILN_4112 GBV_ILN_4125 GBV_ILN_4126 GBV_ILN_4242 GBV_ILN_4246 GBV_ILN_4249 GBV_ILN_4251 GBV_ILN_4305 GBV_ILN_4306 GBV_ILN_4307 GBV_ILN_4313 GBV_ILN_4322 GBV_ILN_4323 GBV_ILN_4324 GBV_ILN_4325 GBV_ILN_4326 GBV_ILN_4328 GBV_ILN_4333 GBV_ILN_4334 GBV_ILN_4335 GBV_ILN_4336 GBV_ILN_4338 GBV_ILN_4393 GBV_ILN_4700 AR 25 2022 3 01 03 1129-1150 |
allfields_unstemmed |
10.1007/s11280-022-01025-2 doi (DE-627)SPR046971947 (SPR)s11280-022-01025-2-e DE-627 ger DE-627 rakwb eng Madukwe, Kosisochukwu Judith verfasserin (orcid)0000-0003-1641-6767 aut Token replacement-based data augmentation methods for hate speech detection 2022 Text txt rdacontent Computermedien c rdamedia Online-Ressource cr rdacarrier © The Author(s) 2022 Abstract Hate speech detection mostly involves the use of text data. This data, usually sourced from various social media platforms, have been known to be plagued with numerous issues that result in a reduction of its quality and hence, the quality of the trained models. Some of these issues are the lack of diversity and the diminutive class of interest in the dataset which results in overfitted models that do not generalize well on other or newly collected data. The different ways of handling these issues include augmenting the data with diverse samples, engineering non-redundant features or designing robust classification models. In this study, the focus is on the data augmentation aspect. Data augmentation is a popular method for improving the quality of existing datasets by generating synthetic samples that mimic the distribution of the original samples. There is a lack of extensive studies on how hate speech texts respond to varying textual data augmentation techniques and methods. Specifically, we provide further insight into the token replacement method of textual data augmentation by performing empirical studies that investigate which embedding method(s) is a robust source of synonym for replacement process, what effective method(s) can be used to select words to be replaced, and how to confirm if the label within each class is preserved. Our proposed methods, validated on two commonly used hate speech datasets affected by a known lack of diversity and diminutive class of interest issues, significantly improve classification performance and provides insights into token replacement methods. Hate speech data (dpeaa)DE-He213 Data augmentation (dpeaa)DE-He213 Token substitution (dpeaa)DE-He213 Word replacement (dpeaa)DE-He213 Data generation (dpeaa)DE-He213 Text data (dpeaa)DE-He213 Gao, Xiaoying aut Xue, Bing aut Enthalten in World wide web Dordrecht [u.a.] : Springer Science + Business Media B.V, 1998 25(2022), 3 vom: 01. März, Seite 1129-1150 (DE-627)319530604 (DE-600)2025142-7 1573-1413 nnns volume:25 year:2022 number:3 day:01 month:03 pages:1129-1150 https://dx.doi.org/10.1007/s11280-022-01025-2 kostenfrei Volltext GBV_USEFLAG_A SYSFLAG_A GBV_SPRINGER GBV_ILN_11 GBV_ILN_20 GBV_ILN_22 GBV_ILN_23 GBV_ILN_24 GBV_ILN_31 GBV_ILN_32 GBV_ILN_39 GBV_ILN_40 GBV_ILN_60 GBV_ILN_62 GBV_ILN_63 GBV_ILN_69 GBV_ILN_70 GBV_ILN_73 GBV_ILN_74 GBV_ILN_90 GBV_ILN_95 GBV_ILN_100 GBV_ILN_101 GBV_ILN_105 GBV_ILN_110 GBV_ILN_120 GBV_ILN_138 GBV_ILN_150 GBV_ILN_151 GBV_ILN_152 GBV_ILN_161 GBV_ILN_170 GBV_ILN_171 GBV_ILN_187 GBV_ILN_213 GBV_ILN_224 GBV_ILN_230 GBV_ILN_250 GBV_ILN_281 GBV_ILN_285 GBV_ILN_293 GBV_ILN_370 GBV_ILN_602 GBV_ILN_636 GBV_ILN_702 GBV_ILN_2001 GBV_ILN_2003 GBV_ILN_2004 GBV_ILN_2005 GBV_ILN_2006 GBV_ILN_2007 GBV_ILN_2008 GBV_ILN_2009 GBV_ILN_2010 GBV_ILN_2011 GBV_ILN_2014 GBV_ILN_2015 GBV_ILN_2020 GBV_ILN_2021 GBV_ILN_2025 GBV_ILN_2026 GBV_ILN_2027 GBV_ILN_2031 GBV_ILN_2034 GBV_ILN_2037 GBV_ILN_2038 GBV_ILN_2039 GBV_ILN_2044 GBV_ILN_2048 GBV_ILN_2049 GBV_ILN_2050 GBV_ILN_2055 GBV_ILN_2056 GBV_ILN_2057 GBV_ILN_2059 GBV_ILN_2061 GBV_ILN_2064 GBV_ILN_2065 GBV_ILN_2068 GBV_ILN_2088 GBV_ILN_2093 GBV_ILN_2106 GBV_ILN_2107 GBV_ILN_2108 GBV_ILN_2110 GBV_ILN_2111 GBV_ILN_2112 GBV_ILN_2113 GBV_ILN_2118 GBV_ILN_2122 GBV_ILN_2129 GBV_ILN_2143 GBV_ILN_2144 GBV_ILN_2147 GBV_ILN_2148 GBV_ILN_2152 GBV_ILN_2153 GBV_ILN_2188 GBV_ILN_2190 GBV_ILN_2232 GBV_ILN_2336 GBV_ILN_2446 GBV_ILN_2470 GBV_ILN_2472 GBV_ILN_2507 GBV_ILN_2522 GBV_ILN_2548 GBV_ILN_4035 GBV_ILN_4037 GBV_ILN_4046 GBV_ILN_4112 GBV_ILN_4125 GBV_ILN_4126 GBV_ILN_4242 GBV_ILN_4246 GBV_ILN_4249 GBV_ILN_4251 GBV_ILN_4305 GBV_ILN_4306 GBV_ILN_4307 GBV_ILN_4313 GBV_ILN_4322 GBV_ILN_4323 GBV_ILN_4324 GBV_ILN_4325 GBV_ILN_4326 GBV_ILN_4328 GBV_ILN_4333 GBV_ILN_4334 GBV_ILN_4335 GBV_ILN_4336 GBV_ILN_4338 GBV_ILN_4393 GBV_ILN_4700 AR 25 2022 3 01 03 1129-1150 |
allfieldsGer |
10.1007/s11280-022-01025-2 doi (DE-627)SPR046971947 (SPR)s11280-022-01025-2-e DE-627 ger DE-627 rakwb eng Madukwe, Kosisochukwu Judith verfasserin (orcid)0000-0003-1641-6767 aut Token replacement-based data augmentation methods for hate speech detection 2022 Text txt rdacontent Computermedien c rdamedia Online-Ressource cr rdacarrier © The Author(s) 2022 Abstract Hate speech detection mostly involves the use of text data. This data, usually sourced from various social media platforms, have been known to be plagued with numerous issues that result in a reduction of its quality and hence, the quality of the trained models. Some of these issues are the lack of diversity and the diminutive class of interest in the dataset which results in overfitted models that do not generalize well on other or newly collected data. The different ways of handling these issues include augmenting the data with diverse samples, engineering non-redundant features or designing robust classification models. In this study, the focus is on the data augmentation aspect. Data augmentation is a popular method for improving the quality of existing datasets by generating synthetic samples that mimic the distribution of the original samples. There is a lack of extensive studies on how hate speech texts respond to varying textual data augmentation techniques and methods. Specifically, we provide further insight into the token replacement method of textual data augmentation by performing empirical studies that investigate which embedding method(s) is a robust source of synonym for replacement process, what effective method(s) can be used to select words to be replaced, and how to confirm if the label within each class is preserved. Our proposed methods, validated on two commonly used hate speech datasets affected by a known lack of diversity and diminutive class of interest issues, significantly improve classification performance and provides insights into token replacement methods. Hate speech data (dpeaa)DE-He213 Data augmentation (dpeaa)DE-He213 Token substitution (dpeaa)DE-He213 Word replacement (dpeaa)DE-He213 Data generation (dpeaa)DE-He213 Text data (dpeaa)DE-He213 Gao, Xiaoying aut Xue, Bing aut Enthalten in World wide web Dordrecht [u.a.] : Springer Science + Business Media B.V, 1998 25(2022), 3 vom: 01. März, Seite 1129-1150 (DE-627)319530604 (DE-600)2025142-7 1573-1413 nnns volume:25 year:2022 number:3 day:01 month:03 pages:1129-1150 https://dx.doi.org/10.1007/s11280-022-01025-2 kostenfrei Volltext GBV_USEFLAG_A SYSFLAG_A GBV_SPRINGER GBV_ILN_11 GBV_ILN_20 GBV_ILN_22 GBV_ILN_23 GBV_ILN_24 GBV_ILN_31 GBV_ILN_32 GBV_ILN_39 GBV_ILN_40 GBV_ILN_60 GBV_ILN_62 GBV_ILN_63 GBV_ILN_69 GBV_ILN_70 GBV_ILN_73 GBV_ILN_74 GBV_ILN_90 GBV_ILN_95 GBV_ILN_100 GBV_ILN_101 GBV_ILN_105 GBV_ILN_110 GBV_ILN_120 GBV_ILN_138 GBV_ILN_150 GBV_ILN_151 GBV_ILN_152 GBV_ILN_161 GBV_ILN_170 GBV_ILN_171 GBV_ILN_187 GBV_ILN_213 GBV_ILN_224 GBV_ILN_230 GBV_ILN_250 GBV_ILN_281 GBV_ILN_285 GBV_ILN_293 GBV_ILN_370 GBV_ILN_602 GBV_ILN_636 GBV_ILN_702 GBV_ILN_2001 GBV_ILN_2003 GBV_ILN_2004 GBV_ILN_2005 GBV_ILN_2006 GBV_ILN_2007 GBV_ILN_2008 GBV_ILN_2009 GBV_ILN_2010 GBV_ILN_2011 GBV_ILN_2014 GBV_ILN_2015 GBV_ILN_2020 GBV_ILN_2021 GBV_ILN_2025 GBV_ILN_2026 GBV_ILN_2027 GBV_ILN_2031 GBV_ILN_2034 GBV_ILN_2037 GBV_ILN_2038 GBV_ILN_2039 GBV_ILN_2044 GBV_ILN_2048 GBV_ILN_2049 GBV_ILN_2050 GBV_ILN_2055 GBV_ILN_2056 GBV_ILN_2057 GBV_ILN_2059 GBV_ILN_2061 GBV_ILN_2064 GBV_ILN_2065 GBV_ILN_2068 GBV_ILN_2088 GBV_ILN_2093 GBV_ILN_2106 GBV_ILN_2107 GBV_ILN_2108 GBV_ILN_2110 GBV_ILN_2111 GBV_ILN_2112 GBV_ILN_2113 GBV_ILN_2118 GBV_ILN_2122 GBV_ILN_2129 GBV_ILN_2143 GBV_ILN_2144 GBV_ILN_2147 GBV_ILN_2148 GBV_ILN_2152 GBV_ILN_2153 GBV_ILN_2188 GBV_ILN_2190 GBV_ILN_2232 GBV_ILN_2336 GBV_ILN_2446 GBV_ILN_2470 GBV_ILN_2472 GBV_ILN_2507 GBV_ILN_2522 GBV_ILN_2548 GBV_ILN_4035 GBV_ILN_4037 GBV_ILN_4046 GBV_ILN_4112 GBV_ILN_4125 GBV_ILN_4126 GBV_ILN_4242 GBV_ILN_4246 GBV_ILN_4249 GBV_ILN_4251 GBV_ILN_4305 GBV_ILN_4306 GBV_ILN_4307 GBV_ILN_4313 GBV_ILN_4322 GBV_ILN_4323 GBV_ILN_4324 GBV_ILN_4325 GBV_ILN_4326 GBV_ILN_4328 GBV_ILN_4333 GBV_ILN_4334 GBV_ILN_4335 GBV_ILN_4336 GBV_ILN_4338 GBV_ILN_4393 GBV_ILN_4700 AR 25 2022 3 01 03 1129-1150 |
allfieldsSound |
10.1007/s11280-022-01025-2 doi (DE-627)SPR046971947 (SPR)s11280-022-01025-2-e DE-627 ger DE-627 rakwb eng Madukwe, Kosisochukwu Judith verfasserin (orcid)0000-0003-1641-6767 aut Token replacement-based data augmentation methods for hate speech detection 2022 Text txt rdacontent Computermedien c rdamedia Online-Ressource cr rdacarrier © The Author(s) 2022 Abstract Hate speech detection mostly involves the use of text data. This data, usually sourced from various social media platforms, have been known to be plagued with numerous issues that result in a reduction of its quality and hence, the quality of the trained models. Some of these issues are the lack of diversity and the diminutive class of interest in the dataset which results in overfitted models that do not generalize well on other or newly collected data. The different ways of handling these issues include augmenting the data with diverse samples, engineering non-redundant features or designing robust classification models. In this study, the focus is on the data augmentation aspect. Data augmentation is a popular method for improving the quality of existing datasets by generating synthetic samples that mimic the distribution of the original samples. There is a lack of extensive studies on how hate speech texts respond to varying textual data augmentation techniques and methods. Specifically, we provide further insight into the token replacement method of textual data augmentation by performing empirical studies that investigate which embedding method(s) is a robust source of synonym for replacement process, what effective method(s) can be used to select words to be replaced, and how to confirm if the label within each class is preserved. Our proposed methods, validated on two commonly used hate speech datasets affected by a known lack of diversity and diminutive class of interest issues, significantly improve classification performance and provides insights into token replacement methods. Hate speech data (dpeaa)DE-He213 Data augmentation (dpeaa)DE-He213 Token substitution (dpeaa)DE-He213 Word replacement (dpeaa)DE-He213 Data generation (dpeaa)DE-He213 Text data (dpeaa)DE-He213 Gao, Xiaoying aut Xue, Bing aut Enthalten in World wide web Dordrecht [u.a.] : Springer Science + Business Media B.V, 1998 25(2022), 3 vom: 01. März, Seite 1129-1150 (DE-627)319530604 (DE-600)2025142-7 1573-1413 nnns volume:25 year:2022 number:3 day:01 month:03 pages:1129-1150 https://dx.doi.org/10.1007/s11280-022-01025-2 kostenfrei Volltext GBV_USEFLAG_A SYSFLAG_A GBV_SPRINGER GBV_ILN_11 GBV_ILN_20 GBV_ILN_22 GBV_ILN_23 GBV_ILN_24 GBV_ILN_31 GBV_ILN_32 GBV_ILN_39 GBV_ILN_40 GBV_ILN_60 GBV_ILN_62 GBV_ILN_63 GBV_ILN_69 GBV_ILN_70 GBV_ILN_73 GBV_ILN_74 GBV_ILN_90 GBV_ILN_95 GBV_ILN_100 GBV_ILN_101 GBV_ILN_105 GBV_ILN_110 GBV_ILN_120 GBV_ILN_138 GBV_ILN_150 GBV_ILN_151 GBV_ILN_152 GBV_ILN_161 GBV_ILN_170 GBV_ILN_171 GBV_ILN_187 GBV_ILN_213 GBV_ILN_224 GBV_ILN_230 GBV_ILN_250 GBV_ILN_281 GBV_ILN_285 GBV_ILN_293 GBV_ILN_370 GBV_ILN_602 GBV_ILN_636 GBV_ILN_702 GBV_ILN_2001 GBV_ILN_2003 GBV_ILN_2004 GBV_ILN_2005 GBV_ILN_2006 GBV_ILN_2007 GBV_ILN_2008 GBV_ILN_2009 GBV_ILN_2010 GBV_ILN_2011 GBV_ILN_2014 GBV_ILN_2015 GBV_ILN_2020 GBV_ILN_2021 GBV_ILN_2025 GBV_ILN_2026 GBV_ILN_2027 GBV_ILN_2031 GBV_ILN_2034 GBV_ILN_2037 GBV_ILN_2038 GBV_ILN_2039 GBV_ILN_2044 GBV_ILN_2048 GBV_ILN_2049 GBV_ILN_2050 GBV_ILN_2055 GBV_ILN_2056 GBV_ILN_2057 GBV_ILN_2059 GBV_ILN_2061 GBV_ILN_2064 GBV_ILN_2065 GBV_ILN_2068 GBV_ILN_2088 GBV_ILN_2093 GBV_ILN_2106 GBV_ILN_2107 GBV_ILN_2108 GBV_ILN_2110 GBV_ILN_2111 GBV_ILN_2112 GBV_ILN_2113 GBV_ILN_2118 GBV_ILN_2122 GBV_ILN_2129 GBV_ILN_2143 GBV_ILN_2144 GBV_ILN_2147 GBV_ILN_2148 GBV_ILN_2152 GBV_ILN_2153 GBV_ILN_2188 GBV_ILN_2190 GBV_ILN_2232 GBV_ILN_2336 GBV_ILN_2446 GBV_ILN_2470 GBV_ILN_2472 GBV_ILN_2507 GBV_ILN_2522 GBV_ILN_2548 GBV_ILN_4035 GBV_ILN_4037 GBV_ILN_4046 GBV_ILN_4112 GBV_ILN_4125 GBV_ILN_4126 GBV_ILN_4242 GBV_ILN_4246 GBV_ILN_4249 GBV_ILN_4251 GBV_ILN_4305 GBV_ILN_4306 GBV_ILN_4307 GBV_ILN_4313 GBV_ILN_4322 GBV_ILN_4323 GBV_ILN_4324 GBV_ILN_4325 GBV_ILN_4326 GBV_ILN_4328 GBV_ILN_4333 GBV_ILN_4334 GBV_ILN_4335 GBV_ILN_4336 GBV_ILN_4338 GBV_ILN_4393 GBV_ILN_4700 AR 25 2022 3 01 03 1129-1150 |
language |
English |
source |
Enthalten in World wide web 25(2022), 3 vom: 01. März, Seite 1129-1150 volume:25 year:2022 number:3 day:01 month:03 pages:1129-1150 |
sourceStr |
Enthalten in World wide web 25(2022), 3 vom: 01. März, Seite 1129-1150 volume:25 year:2022 number:3 day:01 month:03 pages:1129-1150 |
format_phy_str_mv |
Article |
institution |
findex.gbv.de |
topic_facet |
Hate speech data Data augmentation Token substitution Word replacement Data generation Text data |
isfreeaccess_bool |
true |
container_title |
World wide web |
authorswithroles_txt_mv |
Madukwe, Kosisochukwu Judith @@aut@@ Gao, Xiaoying @@aut@@ Xue, Bing @@aut@@ |
publishDateDaySort_date |
2022-03-01T00:00:00Z |
hierarchy_top_id |
319530604 |
id |
SPR046971947 |
language_de |
englisch |
fullrecord |
<?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>01000caa a22002652 4500</leader><controlfield tag="001">SPR046971947</controlfield><controlfield tag="003">DE-627</controlfield><controlfield tag="005">20230507180103.0</controlfield><controlfield tag="007">cr uuu---uuuuu</controlfield><controlfield tag="008">220512s2022 xx |||||o 00| ||eng c</controlfield><datafield tag="024" ind1="7" ind2=" "><subfield code="a">10.1007/s11280-022-01025-2</subfield><subfield code="2">doi</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-627)SPR046971947</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(SPR)s11280-022-01025-2-e</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-627</subfield><subfield code="b">ger</subfield><subfield code="c">DE-627</subfield><subfield code="e">rakwb</subfield></datafield><datafield tag="041" ind1=" " ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Madukwe, Kosisochukwu Judith</subfield><subfield code="e">verfasserin</subfield><subfield code="0">(orcid)0000-0003-1641-6767</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Token replacement-based data augmentation methods for hate speech detection</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="c">2022</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="a">Text</subfield><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="a">Computermedien</subfield><subfield code="b">c</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="a">Online-Ressource</subfield><subfield code="b">cr</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="500" ind1=" " ind2=" "><subfield code="a">© The Author(s) 2022</subfield></datafield><datafield tag="520" ind1=" " ind2=" "><subfield code="a">Abstract Hate speech detection mostly involves the use of text data. This data, usually sourced from various social media platforms, have been known to be plagued with numerous issues that result in a reduction of its quality and hence, the quality of the trained models. Some of these issues are the lack of diversity and the diminutive class of interest in the dataset which results in overfitted models that do not generalize well on other or newly collected data. The different ways of handling these issues include augmenting the data with diverse samples, engineering non-redundant features or designing robust classification models. In this study, the focus is on the data augmentation aspect. Data augmentation is a popular method for improving the quality of existing datasets by generating synthetic samples that mimic the distribution of the original samples. There is a lack of extensive studies on how hate speech texts respond to varying textual data augmentation techniques and methods. Specifically, we provide further insight into the token replacement method of textual data augmentation by performing empirical studies that investigate which embedding method(s) is a robust source of synonym for replacement process, what effective method(s) can be used to select words to be replaced, and how to confirm if the label within each class is preserved. Our proposed methods, validated on two commonly used hate speech datasets affected by a known lack of diversity and diminutive class of interest issues, significantly improve classification performance and provides insights into token replacement methods.</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Hate speech data</subfield><subfield code="7">(dpeaa)DE-He213</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Data augmentation</subfield><subfield code="7">(dpeaa)DE-He213</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Token substitution</subfield><subfield code="7">(dpeaa)DE-He213</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Word replacement</subfield><subfield code="7">(dpeaa)DE-He213</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Data generation</subfield><subfield code="7">(dpeaa)DE-He213</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Text data</subfield><subfield code="7">(dpeaa)DE-He213</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Gao, Xiaoying</subfield><subfield code="4">aut</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Xue, Bing</subfield><subfield code="4">aut</subfield></datafield><datafield tag="773" ind1="0" ind2="8"><subfield code="i">Enthalten in</subfield><subfield code="t">World wide web</subfield><subfield code="d">Dordrecht [u.a.] : Springer Science + Business Media B.V, 1998</subfield><subfield code="g">25(2022), 3 vom: 01. März, Seite 1129-1150</subfield><subfield code="w">(DE-627)319530604</subfield><subfield code="w">(DE-600)2025142-7</subfield><subfield code="x">1573-1413</subfield><subfield code="7">nnns</subfield></datafield><datafield tag="773" ind1="1" ind2="8"><subfield code="g">volume:25</subfield><subfield code="g">year:2022</subfield><subfield code="g">number:3</subfield><subfield code="g">day:01</subfield><subfield code="g">month:03</subfield><subfield code="g">pages:1129-1150</subfield></datafield><datafield tag="856" ind1="4" ind2="0"><subfield code="u">https://dx.doi.org/10.1007/s11280-022-01025-2</subfield><subfield code="z">kostenfrei</subfield><subfield code="3">Volltext</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_USEFLAG_A</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SYSFLAG_A</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_SPRINGER</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_11</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_20</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_22</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_23</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_24</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_31</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_32</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_39</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_40</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_60</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_62</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_63</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_69</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_70</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_73</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_74</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_90</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_95</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_100</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_101</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_105</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_110</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_120</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_138</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_150</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_151</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_152</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_161</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_170</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_171</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_187</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_213</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_224</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_230</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_250</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_281</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_285</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_293</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_370</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_602</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_636</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_702</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2001</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2003</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2004</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2005</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2006</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2007</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2008</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2009</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2010</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2011</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2014</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2015</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2020</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2021</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2025</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2026</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2027</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2031</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2034</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2037</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2038</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2039</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2044</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2048</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2049</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2050</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2055</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2056</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2057</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2059</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2061</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2064</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2065</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2068</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2088</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2093</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2106</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2107</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2108</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2110</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2111</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2112</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2113</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2118</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2122</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2129</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2143</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2144</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2147</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2148</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2152</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2153</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2188</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2190</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2232</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2336</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2446</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2470</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2472</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2507</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2522</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2548</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4035</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4037</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4046</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4112</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4125</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4126</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4242</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4246</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4249</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4251</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4305</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4306</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4307</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4313</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4322</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4323</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4324</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4325</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4326</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4328</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4333</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4334</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4335</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4336</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4338</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4393</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4700</subfield></datafield><datafield tag="951" ind1=" " ind2=" "><subfield code="a">AR</subfield></datafield><datafield tag="952" ind1=" " ind2=" "><subfield code="d">25</subfield><subfield code="j">2022</subfield><subfield code="e">3</subfield><subfield code="b">01</subfield><subfield code="c">03</subfield><subfield code="h">1129-1150</subfield></datafield></record></collection>
|
author |
Madukwe, Kosisochukwu Judith |
spellingShingle |
Madukwe, Kosisochukwu Judith misc Hate speech data misc Data augmentation misc Token substitution misc Word replacement misc Data generation misc Text data Token replacement-based data augmentation methods for hate speech detection |
authorStr |
Madukwe, Kosisochukwu Judith |
ppnlink_with_tag_str_mv |
@@773@@(DE-627)319530604 |
format |
electronic Article |
delete_txt_mv |
keep |
author_role |
aut aut aut |
collection |
springer |
remote_str |
true |
illustrated |
Not Illustrated |
issn |
1573-1413 |
topic_title |
Token replacement-based data augmentation methods for hate speech detection Hate speech data (dpeaa)DE-He213 Data augmentation (dpeaa)DE-He213 Token substitution (dpeaa)DE-He213 Word replacement (dpeaa)DE-He213 Data generation (dpeaa)DE-He213 Text data (dpeaa)DE-He213 |
topic |
misc Hate speech data misc Data augmentation misc Token substitution misc Word replacement misc Data generation misc Text data |
topic_unstemmed |
misc Hate speech data misc Data augmentation misc Token substitution misc Word replacement misc Data generation misc Text data |
topic_browse |
misc Hate speech data misc Data augmentation misc Token substitution misc Word replacement misc Data generation misc Text data |
format_facet |
Elektronische Aufsätze Aufsätze Elektronische Ressource |
format_main_str_mv |
Text Zeitschrift/Artikel |
carriertype_str_mv |
cr |
hierarchy_parent_title |
World wide web |
hierarchy_parent_id |
319530604 |
hierarchy_top_title |
World wide web |
isfreeaccess_txt |
true |
familylinks_str_mv |
(DE-627)319530604 (DE-600)2025142-7 |
title |
Token replacement-based data augmentation methods for hate speech detection |
ctrlnum |
(DE-627)SPR046971947 (SPR)s11280-022-01025-2-e |
title_full |
Token replacement-based data augmentation methods for hate speech detection |
author_sort |
Madukwe, Kosisochukwu Judith |
journal |
World wide web |
journalStr |
World wide web |
lang_code |
eng |
isOA_bool |
true |
recordtype |
marc |
publishDateSort |
2022 |
contenttype_str_mv |
txt |
container_start_page |
1129 |
author_browse |
Madukwe, Kosisochukwu Judith Gao, Xiaoying Xue, Bing |
container_volume |
25 |
format_se |
Elektronische Aufsätze |
author-letter |
Madukwe, Kosisochukwu Judith |
doi_str_mv |
10.1007/s11280-022-01025-2 |
normlink |
(ORCID)0000-0003-1641-6767 |
normlink_prefix_str_mv |
(orcid)0000-0003-1641-6767 |
title_sort |
token replacement-based data augmentation methods for hate speech detection |
title_auth |
Token replacement-based data augmentation methods for hate speech detection |
abstract |
Abstract Hate speech detection mostly involves the use of text data. This data, usually sourced from various social media platforms, have been known to be plagued with numerous issues that result in a reduction of its quality and hence, the quality of the trained models. Some of these issues are the lack of diversity and the diminutive class of interest in the dataset which results in overfitted models that do not generalize well on other or newly collected data. The different ways of handling these issues include augmenting the data with diverse samples, engineering non-redundant features or designing robust classification models. In this study, the focus is on the data augmentation aspect. Data augmentation is a popular method for improving the quality of existing datasets by generating synthetic samples that mimic the distribution of the original samples. There is a lack of extensive studies on how hate speech texts respond to varying textual data augmentation techniques and methods. Specifically, we provide further insight into the token replacement method of textual data augmentation by performing empirical studies that investigate which embedding method(s) is a robust source of synonym for replacement process, what effective method(s) can be used to select words to be replaced, and how to confirm if the label within each class is preserved. Our proposed methods, validated on two commonly used hate speech datasets affected by a known lack of diversity and diminutive class of interest issues, significantly improve classification performance and provides insights into token replacement methods. © The Author(s) 2022 |
abstractGer |
Abstract Hate speech detection mostly involves the use of text data. This data, usually sourced from various social media platforms, have been known to be plagued with numerous issues that result in a reduction of its quality and hence, the quality of the trained models. Some of these issues are the lack of diversity and the diminutive class of interest in the dataset which results in overfitted models that do not generalize well on other or newly collected data. The different ways of handling these issues include augmenting the data with diverse samples, engineering non-redundant features or designing robust classification models. In this study, the focus is on the data augmentation aspect. Data augmentation is a popular method for improving the quality of existing datasets by generating synthetic samples that mimic the distribution of the original samples. There is a lack of extensive studies on how hate speech texts respond to varying textual data augmentation techniques and methods. Specifically, we provide further insight into the token replacement method of textual data augmentation by performing empirical studies that investigate which embedding method(s) is a robust source of synonym for replacement process, what effective method(s) can be used to select words to be replaced, and how to confirm if the label within each class is preserved. Our proposed methods, validated on two commonly used hate speech datasets affected by a known lack of diversity and diminutive class of interest issues, significantly improve classification performance and provides insights into token replacement methods. © The Author(s) 2022 |
abstract_unstemmed |
Abstract Hate speech detection mostly involves the use of text data. This data, usually sourced from various social media platforms, have been known to be plagued with numerous issues that result in a reduction of its quality and hence, the quality of the trained models. Some of these issues are the lack of diversity and the diminutive class of interest in the dataset which results in overfitted models that do not generalize well on other or newly collected data. The different ways of handling these issues include augmenting the data with diverse samples, engineering non-redundant features or designing robust classification models. In this study, the focus is on the data augmentation aspect. Data augmentation is a popular method for improving the quality of existing datasets by generating synthetic samples that mimic the distribution of the original samples. There is a lack of extensive studies on how hate speech texts respond to varying textual data augmentation techniques and methods. Specifically, we provide further insight into the token replacement method of textual data augmentation by performing empirical studies that investigate which embedding method(s) is a robust source of synonym for replacement process, what effective method(s) can be used to select words to be replaced, and how to confirm if the label within each class is preserved. Our proposed methods, validated on two commonly used hate speech datasets affected by a known lack of diversity and diminutive class of interest issues, significantly improve classification performance and provides insights into token replacement methods. © The Author(s) 2022 |
collection_details |
GBV_USEFLAG_A SYSFLAG_A GBV_SPRINGER GBV_ILN_11 GBV_ILN_20 GBV_ILN_22 GBV_ILN_23 GBV_ILN_24 GBV_ILN_31 GBV_ILN_32 GBV_ILN_39 GBV_ILN_40 GBV_ILN_60 GBV_ILN_62 GBV_ILN_63 GBV_ILN_69 GBV_ILN_70 GBV_ILN_73 GBV_ILN_74 GBV_ILN_90 GBV_ILN_95 GBV_ILN_100 GBV_ILN_101 GBV_ILN_105 GBV_ILN_110 GBV_ILN_120 GBV_ILN_138 GBV_ILN_150 GBV_ILN_151 GBV_ILN_152 GBV_ILN_161 GBV_ILN_170 GBV_ILN_171 GBV_ILN_187 GBV_ILN_213 GBV_ILN_224 GBV_ILN_230 GBV_ILN_250 GBV_ILN_281 GBV_ILN_285 GBV_ILN_293 GBV_ILN_370 GBV_ILN_602 GBV_ILN_636 GBV_ILN_702 GBV_ILN_2001 GBV_ILN_2003 GBV_ILN_2004 GBV_ILN_2005 GBV_ILN_2006 GBV_ILN_2007 GBV_ILN_2008 GBV_ILN_2009 GBV_ILN_2010 GBV_ILN_2011 GBV_ILN_2014 GBV_ILN_2015 GBV_ILN_2020 GBV_ILN_2021 GBV_ILN_2025 GBV_ILN_2026 GBV_ILN_2027 GBV_ILN_2031 GBV_ILN_2034 GBV_ILN_2037 GBV_ILN_2038 GBV_ILN_2039 GBV_ILN_2044 GBV_ILN_2048 GBV_ILN_2049 GBV_ILN_2050 GBV_ILN_2055 GBV_ILN_2056 GBV_ILN_2057 GBV_ILN_2059 GBV_ILN_2061 GBV_ILN_2064 GBV_ILN_2065 GBV_ILN_2068 GBV_ILN_2088 GBV_ILN_2093 GBV_ILN_2106 GBV_ILN_2107 GBV_ILN_2108 GBV_ILN_2110 GBV_ILN_2111 GBV_ILN_2112 GBV_ILN_2113 GBV_ILN_2118 GBV_ILN_2122 GBV_ILN_2129 GBV_ILN_2143 GBV_ILN_2144 GBV_ILN_2147 GBV_ILN_2148 GBV_ILN_2152 GBV_ILN_2153 GBV_ILN_2188 GBV_ILN_2190 GBV_ILN_2232 GBV_ILN_2336 GBV_ILN_2446 GBV_ILN_2470 GBV_ILN_2472 GBV_ILN_2507 GBV_ILN_2522 GBV_ILN_2548 GBV_ILN_4035 GBV_ILN_4037 GBV_ILN_4046 GBV_ILN_4112 GBV_ILN_4125 GBV_ILN_4126 GBV_ILN_4242 GBV_ILN_4246 GBV_ILN_4249 GBV_ILN_4251 GBV_ILN_4305 GBV_ILN_4306 GBV_ILN_4307 GBV_ILN_4313 GBV_ILN_4322 GBV_ILN_4323 GBV_ILN_4324 GBV_ILN_4325 GBV_ILN_4326 GBV_ILN_4328 GBV_ILN_4333 GBV_ILN_4334 GBV_ILN_4335 GBV_ILN_4336 GBV_ILN_4338 GBV_ILN_4393 GBV_ILN_4700 |
container_issue |
3 |
title_short |
Token replacement-based data augmentation methods for hate speech detection |
url |
https://dx.doi.org/10.1007/s11280-022-01025-2 |
remote_bool |
true |
author2 |
Gao, Xiaoying Xue, Bing |
author2Str |
Gao, Xiaoying Xue, Bing |
ppnlink |
319530604 |
mediatype_str_mv |
c |
isOA_txt |
true |
hochschulschrift_bool |
false |
doi_str |
10.1007/s11280-022-01025-2 |
up_date |
2024-07-04T01:16:39.361Z |
_version_ |
1803609229872857088 |
fullrecord_marcxml |
<?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>01000caa a22002652 4500</leader><controlfield tag="001">SPR046971947</controlfield><controlfield tag="003">DE-627</controlfield><controlfield tag="005">20230507180103.0</controlfield><controlfield tag="007">cr uuu---uuuuu</controlfield><controlfield tag="008">220512s2022 xx |||||o 00| ||eng c</controlfield><datafield tag="024" ind1="7" ind2=" "><subfield code="a">10.1007/s11280-022-01025-2</subfield><subfield code="2">doi</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-627)SPR046971947</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(SPR)s11280-022-01025-2-e</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-627</subfield><subfield code="b">ger</subfield><subfield code="c">DE-627</subfield><subfield code="e">rakwb</subfield></datafield><datafield tag="041" ind1=" " ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Madukwe, Kosisochukwu Judith</subfield><subfield code="e">verfasserin</subfield><subfield code="0">(orcid)0000-0003-1641-6767</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Token replacement-based data augmentation methods for hate speech detection</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="c">2022</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="a">Text</subfield><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="a">Computermedien</subfield><subfield code="b">c</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="a">Online-Ressource</subfield><subfield code="b">cr</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="500" ind1=" " ind2=" "><subfield code="a">© The Author(s) 2022</subfield></datafield><datafield tag="520" ind1=" " ind2=" "><subfield code="a">Abstract Hate speech detection mostly involves the use of text data. This data, usually sourced from various social media platforms, have been known to be plagued with numerous issues that result in a reduction of its quality and hence, the quality of the trained models. Some of these issues are the lack of diversity and the diminutive class of interest in the dataset which results in overfitted models that do not generalize well on other or newly collected data. The different ways of handling these issues include augmenting the data with diverse samples, engineering non-redundant features or designing robust classification models. In this study, the focus is on the data augmentation aspect. Data augmentation is a popular method for improving the quality of existing datasets by generating synthetic samples that mimic the distribution of the original samples. There is a lack of extensive studies on how hate speech texts respond to varying textual data augmentation techniques and methods. Specifically, we provide further insight into the token replacement method of textual data augmentation by performing empirical studies that investigate which embedding method(s) is a robust source of synonym for replacement process, what effective method(s) can be used to select words to be replaced, and how to confirm if the label within each class is preserved. Our proposed methods, validated on two commonly used hate speech datasets affected by a known lack of diversity and diminutive class of interest issues, significantly improve classification performance and provides insights into token replacement methods.</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Hate speech data</subfield><subfield code="7">(dpeaa)DE-He213</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Data augmentation</subfield><subfield code="7">(dpeaa)DE-He213</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Token substitution</subfield><subfield code="7">(dpeaa)DE-He213</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Word replacement</subfield><subfield code="7">(dpeaa)DE-He213</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Data generation</subfield><subfield code="7">(dpeaa)DE-He213</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Text data</subfield><subfield code="7">(dpeaa)DE-He213</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Gao, Xiaoying</subfield><subfield code="4">aut</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Xue, Bing</subfield><subfield code="4">aut</subfield></datafield><datafield tag="773" ind1="0" ind2="8"><subfield code="i">Enthalten in</subfield><subfield code="t">World wide web</subfield><subfield code="d">Dordrecht [u.a.] : Springer Science + Business Media B.V, 1998</subfield><subfield code="g">25(2022), 3 vom: 01. März, Seite 1129-1150</subfield><subfield code="w">(DE-627)319530604</subfield><subfield code="w">(DE-600)2025142-7</subfield><subfield code="x">1573-1413</subfield><subfield code="7">nnns</subfield></datafield><datafield tag="773" ind1="1" ind2="8"><subfield code="g">volume:25</subfield><subfield code="g">year:2022</subfield><subfield code="g">number:3</subfield><subfield code="g">day:01</subfield><subfield code="g">month:03</subfield><subfield code="g">pages:1129-1150</subfield></datafield><datafield tag="856" ind1="4" ind2="0"><subfield code="u">https://dx.doi.org/10.1007/s11280-022-01025-2</subfield><subfield code="z">kostenfrei</subfield><subfield code="3">Volltext</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_USEFLAG_A</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SYSFLAG_A</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_SPRINGER</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_11</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_20</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_22</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_23</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_24</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_31</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_32</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_39</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_40</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_60</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_62</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_63</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_69</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_70</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_73</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_74</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_90</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_95</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_100</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_101</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_105</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_110</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_120</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_138</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_150</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_151</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_152</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_161</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_170</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_171</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_187</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_213</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_224</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_230</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_250</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_281</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_285</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_293</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_370</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_602</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_636</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_702</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2001</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2003</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2004</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2005</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2006</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2007</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2008</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2009</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2010</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2011</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2014</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2015</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2020</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2021</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2025</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2026</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2027</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2031</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2034</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2037</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2038</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2039</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2044</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2048</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2049</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2050</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2055</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2056</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2057</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2059</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2061</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2064</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2065</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2068</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2088</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2093</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2106</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2107</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2108</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2110</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2111</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2112</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2113</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2118</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2122</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2129</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2143</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2144</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2147</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2148</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2152</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2153</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2188</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2190</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2232</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2336</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2446</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2470</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2472</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2507</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2522</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2548</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4035</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4037</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4046</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4112</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4125</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4126</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4242</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4246</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4249</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4251</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4305</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4306</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4307</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4313</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4322</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4323</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4324</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4325</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4326</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4328</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4333</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4334</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4335</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4336</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4338</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4393</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4700</subfield></datafield><datafield tag="951" ind1=" " ind2=" "><subfield code="a">AR</subfield></datafield><datafield tag="952" ind1=" " ind2=" "><subfield code="d">25</subfield><subfield code="j">2022</subfield><subfield code="e">3</subfield><subfield code="b">01</subfield><subfield code="c">03</subfield><subfield code="h">1129-1150</subfield></datafield></record></collection>
|
score |
7.399761 |