Selecting frames for automatic speech recognition based on acoustic landmarks
Most mainstream Mel-frequency cepstral coefficient (MFCC) based Automatic Speech Recognition (ASR) systems consider all feature frames equally important. However, the acoustic landmark theory disagrees with this idea. Acoustic landmark theory exploits the quantal non-linear articulatory-acoustic rel...
Ausführliche Beschreibung
Autor*in: |
He, Di [verfasserIn] |
---|
Format: |
Artikel |
---|
Erschienen: |
2017 |
---|
Rechteinformationen: |
Nutzungsrecht: © Acoustical Society of America |
---|
Systematik: |
|
---|
Übergeordnetes Werk: |
Enthalten in: The journal of the Acoustical Society of America - Melville, NY : AIP, 1929, 141(2017), 5, Seite 3468-3468 |
---|---|
Übergeordnetes Werk: |
volume:141 ; year:2017 ; number:5 ; pages:3468-3468 |
Links: |
---|
DOI / URN: |
10.1121/1.4987204 |
---|
Katalog-ID: |
OLC1994830964 |
---|
LEADER | 01000caa a2200265 4500 | ||
---|---|---|---|
001 | OLC1994830964 | ||
003 | DE-627 | ||
005 | 20220223210545.0 | ||
007 | tu | ||
008 | 170721s2017 xx ||||| 00| ||und c | ||
024 | 7 | |a 10.1121/1.4987204 |2 doi | |
028 | 5 | 2 | |a PQ20170901 |
035 | |a (DE-627)OLC1994830964 | ||
035 | |a (DE-599)GBVOLC1994830964 | ||
035 | |a (PRQ)scitation_primary_10_1121_1_49872040 | ||
035 | |a (KEY)0112299120170000141000503468selectingframesforautomaticspeechrecognitionbasedo | ||
040 | |a DE-627 |b ger |c DE-627 |e rakwb | ||
082 | 0 | 4 | |a 530 |q DE-600 |
084 | |a LING |2 fid | ||
084 | |a EQ 1000: |q AVZ |2 rvk | ||
084 | |a 33.12 |2 bkl | ||
084 | |a 50.36 |2 bkl | ||
100 | 1 | |a He, Di |e verfasserin |4 aut | |
245 | 1 | 0 | |a Selecting frames for automatic speech recognition based on acoustic landmarks |
264 | 1 | |c 2017 | |
336 | |a Text |b txt |2 rdacontent | ||
337 | |a ohne Hilfsmittel zu benutzen |b n |2 rdamedia | ||
338 | |a Band |b nc |2 rdacarrier | ||
520 | |a Most mainstream Mel-frequency cepstral coefficient (MFCC) based Automatic Speech Recognition (ASR) systems consider all feature frames equally important. However, the acoustic landmark theory disagrees with this idea. Acoustic landmark theory exploits the quantal non-linear articulatory-acoustic relationships from human speech perception experiments and provides a theoretical basis of extracting acoustic features in the vicinity of landmark regions where an abrupt change occurs in the spectrum of speech signals. In this work, we conducted experiments, using the TIMIT corpus, on both GMM and DNN based ASR systems and found that frames containing landmarks are more informative than others during the recognition process. We proved that altering the level of emphasis on landmark and non-landmark frames, through re-weighting or removing frame acoustic likelihoods accordingly, can change the phone error rate (PER) of the ASR system in a way dramatically different from making similar changes to random frames. Furthermore, by leveraging the landmark as a heuristic, one of our hybrid DNN frame dropping strategies achieved a PER increment of 0.44% when only scoring less than half, 41.2% to be precise, of the frames. This hybrid strategy out-performs other non-heuristic-based methods and demonstrated the potential of landmarks for computational reduction for ASR. | ||
540 | |a Nutzungsrecht: © Acoustical Society of America | ||
700 | 1 | |a Lim, Boon Pang P |4 oth | |
700 | 1 | |a Yang, Xuesong |4 oth | |
700 | 1 | |a Hasegawa-Johnson, Mark |4 oth | |
700 | 1 | |a Chen, Deming |4 oth | |
773 | 0 | 8 | |i Enthalten in |t The journal of the Acoustical Society of America |d Melville, NY : AIP, 1929 |g 141(2017), 5, Seite 3468-3468 |w (DE-627)129550264 |w (DE-600)219231-7 |w (DE-576)015003663 |x 0001-4966 |7 nnns |
773 | 1 | 8 | |g volume:141 |g year:2017 |g number:5 |g pages:3468-3468 |
856 | 4 | 1 | |u http://dx.doi.org/10.1121/1.4987204 |3 Volltext |
856 | 4 | 2 | |u http://dx.doi.org/10.1121/1.4987204 |
912 | |a GBV_USEFLAG_A | ||
912 | |a SYSFLAG_A | ||
912 | |a GBV_OLC | ||
912 | |a FID-LING | ||
912 | |a SSG-OLC-PHY | ||
912 | |a SSG-OLC-MUS | ||
912 | |a GBV_ILN_59 | ||
912 | |a GBV_ILN_60 | ||
912 | |a GBV_ILN_70 | ||
912 | |a GBV_ILN_120 | ||
912 | |a GBV_ILN_170 | ||
912 | |a GBV_ILN_201 | ||
912 | |a GBV_ILN_2006 | ||
912 | |a GBV_ILN_2011 | ||
912 | |a GBV_ILN_2027 | ||
912 | |a GBV_ILN_2045 | ||
912 | |a GBV_ILN_2192 | ||
912 | |a GBV_ILN_2256 | ||
912 | |a GBV_ILN_4219 | ||
912 | |a GBV_ILN_4315 | ||
912 | |a GBV_ILN_4319 | ||
912 | |a GBV_ILN_4700 | ||
936 | r | v | |a EQ 1000: |
936 | b | k | |a 33.12 |q AVZ |
936 | b | k | |a 50.36 |q AVZ |
951 | |a AR | ||
952 | |d 141 |j 2017 |e 5 |h 3468-3468 |
author_variant |
d h dh |
---|---|
matchkey_str |
article:00014966:2017----::eetnfaefruoaisecrcgiinaeo |
hierarchy_sort_str |
2017 |
bklnumber |
33.12 50.36 |
publishDate |
2017 |
allfields |
10.1121/1.4987204 doi PQ20170901 (DE-627)OLC1994830964 (DE-599)GBVOLC1994830964 (PRQ)scitation_primary_10_1121_1_49872040 (KEY)0112299120170000141000503468selectingframesforautomaticspeechrecognitionbasedo DE-627 ger DE-627 rakwb 530 DE-600 LING fid EQ 1000: AVZ rvk 33.12 bkl 50.36 bkl He, Di verfasserin aut Selecting frames for automatic speech recognition based on acoustic landmarks 2017 Text txt rdacontent ohne Hilfsmittel zu benutzen n rdamedia Band nc rdacarrier Most mainstream Mel-frequency cepstral coefficient (MFCC) based Automatic Speech Recognition (ASR) systems consider all feature frames equally important. However, the acoustic landmark theory disagrees with this idea. Acoustic landmark theory exploits the quantal non-linear articulatory-acoustic relationships from human speech perception experiments and provides a theoretical basis of extracting acoustic features in the vicinity of landmark regions where an abrupt change occurs in the spectrum of speech signals. In this work, we conducted experiments, using the TIMIT corpus, on both GMM and DNN based ASR systems and found that frames containing landmarks are more informative than others during the recognition process. We proved that altering the level of emphasis on landmark and non-landmark frames, through re-weighting or removing frame acoustic likelihoods accordingly, can change the phone error rate (PER) of the ASR system in a way dramatically different from making similar changes to random frames. Furthermore, by leveraging the landmark as a heuristic, one of our hybrid DNN frame dropping strategies achieved a PER increment of 0.44% when only scoring less than half, 41.2% to be precise, of the frames. This hybrid strategy out-performs other non-heuristic-based methods and demonstrated the potential of landmarks for computational reduction for ASR. Nutzungsrecht: © Acoustical Society of America Lim, Boon Pang P oth Yang, Xuesong oth Hasegawa-Johnson, Mark oth Chen, Deming oth Enthalten in The journal of the Acoustical Society of America Melville, NY : AIP, 1929 141(2017), 5, Seite 3468-3468 (DE-627)129550264 (DE-600)219231-7 (DE-576)015003663 0001-4966 nnns volume:141 year:2017 number:5 pages:3468-3468 http://dx.doi.org/10.1121/1.4987204 Volltext http://dx.doi.org/10.1121/1.4987204 GBV_USEFLAG_A SYSFLAG_A GBV_OLC FID-LING SSG-OLC-PHY SSG-OLC-MUS GBV_ILN_59 GBV_ILN_60 GBV_ILN_70 GBV_ILN_120 GBV_ILN_170 GBV_ILN_201 GBV_ILN_2006 GBV_ILN_2011 GBV_ILN_2027 GBV_ILN_2045 GBV_ILN_2192 GBV_ILN_2256 GBV_ILN_4219 GBV_ILN_4315 GBV_ILN_4319 GBV_ILN_4700 EQ 1000: 33.12 AVZ 50.36 AVZ AR 141 2017 5 3468-3468 |
spelling |
10.1121/1.4987204 doi PQ20170901 (DE-627)OLC1994830964 (DE-599)GBVOLC1994830964 (PRQ)scitation_primary_10_1121_1_49872040 (KEY)0112299120170000141000503468selectingframesforautomaticspeechrecognitionbasedo DE-627 ger DE-627 rakwb 530 DE-600 LING fid EQ 1000: AVZ rvk 33.12 bkl 50.36 bkl He, Di verfasserin aut Selecting frames for automatic speech recognition based on acoustic landmarks 2017 Text txt rdacontent ohne Hilfsmittel zu benutzen n rdamedia Band nc rdacarrier Most mainstream Mel-frequency cepstral coefficient (MFCC) based Automatic Speech Recognition (ASR) systems consider all feature frames equally important. However, the acoustic landmark theory disagrees with this idea. Acoustic landmark theory exploits the quantal non-linear articulatory-acoustic relationships from human speech perception experiments and provides a theoretical basis of extracting acoustic features in the vicinity of landmark regions where an abrupt change occurs in the spectrum of speech signals. In this work, we conducted experiments, using the TIMIT corpus, on both GMM and DNN based ASR systems and found that frames containing landmarks are more informative than others during the recognition process. We proved that altering the level of emphasis on landmark and non-landmark frames, through re-weighting or removing frame acoustic likelihoods accordingly, can change the phone error rate (PER) of the ASR system in a way dramatically different from making similar changes to random frames. Furthermore, by leveraging the landmark as a heuristic, one of our hybrid DNN frame dropping strategies achieved a PER increment of 0.44% when only scoring less than half, 41.2% to be precise, of the frames. This hybrid strategy out-performs other non-heuristic-based methods and demonstrated the potential of landmarks for computational reduction for ASR. Nutzungsrecht: © Acoustical Society of America Lim, Boon Pang P oth Yang, Xuesong oth Hasegawa-Johnson, Mark oth Chen, Deming oth Enthalten in The journal of the Acoustical Society of America Melville, NY : AIP, 1929 141(2017), 5, Seite 3468-3468 (DE-627)129550264 (DE-600)219231-7 (DE-576)015003663 0001-4966 nnns volume:141 year:2017 number:5 pages:3468-3468 http://dx.doi.org/10.1121/1.4987204 Volltext http://dx.doi.org/10.1121/1.4987204 GBV_USEFLAG_A SYSFLAG_A GBV_OLC FID-LING SSG-OLC-PHY SSG-OLC-MUS GBV_ILN_59 GBV_ILN_60 GBV_ILN_70 GBV_ILN_120 GBV_ILN_170 GBV_ILN_201 GBV_ILN_2006 GBV_ILN_2011 GBV_ILN_2027 GBV_ILN_2045 GBV_ILN_2192 GBV_ILN_2256 GBV_ILN_4219 GBV_ILN_4315 GBV_ILN_4319 GBV_ILN_4700 EQ 1000: 33.12 AVZ 50.36 AVZ AR 141 2017 5 3468-3468 |
allfields_unstemmed |
10.1121/1.4987204 doi PQ20170901 (DE-627)OLC1994830964 (DE-599)GBVOLC1994830964 (PRQ)scitation_primary_10_1121_1_49872040 (KEY)0112299120170000141000503468selectingframesforautomaticspeechrecognitionbasedo DE-627 ger DE-627 rakwb 530 DE-600 LING fid EQ 1000: AVZ rvk 33.12 bkl 50.36 bkl He, Di verfasserin aut Selecting frames for automatic speech recognition based on acoustic landmarks 2017 Text txt rdacontent ohne Hilfsmittel zu benutzen n rdamedia Band nc rdacarrier Most mainstream Mel-frequency cepstral coefficient (MFCC) based Automatic Speech Recognition (ASR) systems consider all feature frames equally important. However, the acoustic landmark theory disagrees with this idea. Acoustic landmark theory exploits the quantal non-linear articulatory-acoustic relationships from human speech perception experiments and provides a theoretical basis of extracting acoustic features in the vicinity of landmark regions where an abrupt change occurs in the spectrum of speech signals. In this work, we conducted experiments, using the TIMIT corpus, on both GMM and DNN based ASR systems and found that frames containing landmarks are more informative than others during the recognition process. We proved that altering the level of emphasis on landmark and non-landmark frames, through re-weighting or removing frame acoustic likelihoods accordingly, can change the phone error rate (PER) of the ASR system in a way dramatically different from making similar changes to random frames. Furthermore, by leveraging the landmark as a heuristic, one of our hybrid DNN frame dropping strategies achieved a PER increment of 0.44% when only scoring less than half, 41.2% to be precise, of the frames. This hybrid strategy out-performs other non-heuristic-based methods and demonstrated the potential of landmarks for computational reduction for ASR. Nutzungsrecht: © Acoustical Society of America Lim, Boon Pang P oth Yang, Xuesong oth Hasegawa-Johnson, Mark oth Chen, Deming oth Enthalten in The journal of the Acoustical Society of America Melville, NY : AIP, 1929 141(2017), 5, Seite 3468-3468 (DE-627)129550264 (DE-600)219231-7 (DE-576)015003663 0001-4966 nnns volume:141 year:2017 number:5 pages:3468-3468 http://dx.doi.org/10.1121/1.4987204 Volltext http://dx.doi.org/10.1121/1.4987204 GBV_USEFLAG_A SYSFLAG_A GBV_OLC FID-LING SSG-OLC-PHY SSG-OLC-MUS GBV_ILN_59 GBV_ILN_60 GBV_ILN_70 GBV_ILN_120 GBV_ILN_170 GBV_ILN_201 GBV_ILN_2006 GBV_ILN_2011 GBV_ILN_2027 GBV_ILN_2045 GBV_ILN_2192 GBV_ILN_2256 GBV_ILN_4219 GBV_ILN_4315 GBV_ILN_4319 GBV_ILN_4700 EQ 1000: 33.12 AVZ 50.36 AVZ AR 141 2017 5 3468-3468 |
allfieldsGer |
10.1121/1.4987204 doi PQ20170901 (DE-627)OLC1994830964 (DE-599)GBVOLC1994830964 (PRQ)scitation_primary_10_1121_1_49872040 (KEY)0112299120170000141000503468selectingframesforautomaticspeechrecognitionbasedo DE-627 ger DE-627 rakwb 530 DE-600 LING fid EQ 1000: AVZ rvk 33.12 bkl 50.36 bkl He, Di verfasserin aut Selecting frames for automatic speech recognition based on acoustic landmarks 2017 Text txt rdacontent ohne Hilfsmittel zu benutzen n rdamedia Band nc rdacarrier Most mainstream Mel-frequency cepstral coefficient (MFCC) based Automatic Speech Recognition (ASR) systems consider all feature frames equally important. However, the acoustic landmark theory disagrees with this idea. Acoustic landmark theory exploits the quantal non-linear articulatory-acoustic relationships from human speech perception experiments and provides a theoretical basis of extracting acoustic features in the vicinity of landmark regions where an abrupt change occurs in the spectrum of speech signals. In this work, we conducted experiments, using the TIMIT corpus, on both GMM and DNN based ASR systems and found that frames containing landmarks are more informative than others during the recognition process. We proved that altering the level of emphasis on landmark and non-landmark frames, through re-weighting or removing frame acoustic likelihoods accordingly, can change the phone error rate (PER) of the ASR system in a way dramatically different from making similar changes to random frames. Furthermore, by leveraging the landmark as a heuristic, one of our hybrid DNN frame dropping strategies achieved a PER increment of 0.44% when only scoring less than half, 41.2% to be precise, of the frames. This hybrid strategy out-performs other non-heuristic-based methods and demonstrated the potential of landmarks for computational reduction for ASR. Nutzungsrecht: © Acoustical Society of America Lim, Boon Pang P oth Yang, Xuesong oth Hasegawa-Johnson, Mark oth Chen, Deming oth Enthalten in The journal of the Acoustical Society of America Melville, NY : AIP, 1929 141(2017), 5, Seite 3468-3468 (DE-627)129550264 (DE-600)219231-7 (DE-576)015003663 0001-4966 nnns volume:141 year:2017 number:5 pages:3468-3468 http://dx.doi.org/10.1121/1.4987204 Volltext http://dx.doi.org/10.1121/1.4987204 GBV_USEFLAG_A SYSFLAG_A GBV_OLC FID-LING SSG-OLC-PHY SSG-OLC-MUS GBV_ILN_59 GBV_ILN_60 GBV_ILN_70 GBV_ILN_120 GBV_ILN_170 GBV_ILN_201 GBV_ILN_2006 GBV_ILN_2011 GBV_ILN_2027 GBV_ILN_2045 GBV_ILN_2192 GBV_ILN_2256 GBV_ILN_4219 GBV_ILN_4315 GBV_ILN_4319 GBV_ILN_4700 EQ 1000: 33.12 AVZ 50.36 AVZ AR 141 2017 5 3468-3468 |
allfieldsSound |
10.1121/1.4987204 doi PQ20170901 (DE-627)OLC1994830964 (DE-599)GBVOLC1994830964 (PRQ)scitation_primary_10_1121_1_49872040 (KEY)0112299120170000141000503468selectingframesforautomaticspeechrecognitionbasedo DE-627 ger DE-627 rakwb 530 DE-600 LING fid EQ 1000: AVZ rvk 33.12 bkl 50.36 bkl He, Di verfasserin aut Selecting frames for automatic speech recognition based on acoustic landmarks 2017 Text txt rdacontent ohne Hilfsmittel zu benutzen n rdamedia Band nc rdacarrier Most mainstream Mel-frequency cepstral coefficient (MFCC) based Automatic Speech Recognition (ASR) systems consider all feature frames equally important. However, the acoustic landmark theory disagrees with this idea. Acoustic landmark theory exploits the quantal non-linear articulatory-acoustic relationships from human speech perception experiments and provides a theoretical basis of extracting acoustic features in the vicinity of landmark regions where an abrupt change occurs in the spectrum of speech signals. In this work, we conducted experiments, using the TIMIT corpus, on both GMM and DNN based ASR systems and found that frames containing landmarks are more informative than others during the recognition process. We proved that altering the level of emphasis on landmark and non-landmark frames, through re-weighting or removing frame acoustic likelihoods accordingly, can change the phone error rate (PER) of the ASR system in a way dramatically different from making similar changes to random frames. Furthermore, by leveraging the landmark as a heuristic, one of our hybrid DNN frame dropping strategies achieved a PER increment of 0.44% when only scoring less than half, 41.2% to be precise, of the frames. This hybrid strategy out-performs other non-heuristic-based methods and demonstrated the potential of landmarks for computational reduction for ASR. Nutzungsrecht: © Acoustical Society of America Lim, Boon Pang P oth Yang, Xuesong oth Hasegawa-Johnson, Mark oth Chen, Deming oth Enthalten in The journal of the Acoustical Society of America Melville, NY : AIP, 1929 141(2017), 5, Seite 3468-3468 (DE-627)129550264 (DE-600)219231-7 (DE-576)015003663 0001-4966 nnns volume:141 year:2017 number:5 pages:3468-3468 http://dx.doi.org/10.1121/1.4987204 Volltext http://dx.doi.org/10.1121/1.4987204 GBV_USEFLAG_A SYSFLAG_A GBV_OLC FID-LING SSG-OLC-PHY SSG-OLC-MUS GBV_ILN_59 GBV_ILN_60 GBV_ILN_70 GBV_ILN_120 GBV_ILN_170 GBV_ILN_201 GBV_ILN_2006 GBV_ILN_2011 GBV_ILN_2027 GBV_ILN_2045 GBV_ILN_2192 GBV_ILN_2256 GBV_ILN_4219 GBV_ILN_4315 GBV_ILN_4319 GBV_ILN_4700 EQ 1000: 33.12 AVZ 50.36 AVZ AR 141 2017 5 3468-3468 |
source |
Enthalten in The journal of the Acoustical Society of America 141(2017), 5, Seite 3468-3468 volume:141 year:2017 number:5 pages:3468-3468 |
sourceStr |
Enthalten in The journal of the Acoustical Society of America 141(2017), 5, Seite 3468-3468 volume:141 year:2017 number:5 pages:3468-3468 |
format_phy_str_mv |
Article |
institution |
findex.gbv.de |
dewey-raw |
530 |
isfreeaccess_bool |
false |
container_title |
The journal of the Acoustical Society of America |
authorswithroles_txt_mv |
He, Di @@aut@@ Lim, Boon Pang P @@oth@@ Yang, Xuesong @@oth@@ Hasegawa-Johnson, Mark @@oth@@ Chen, Deming @@oth@@ |
publishDateDaySort_date |
2017-01-01T00:00:00Z |
hierarchy_top_id |
129550264 |
dewey-sort |
3530 |
id |
OLC1994830964 |
fullrecord |
<?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>01000caa a2200265 4500</leader><controlfield tag="001">OLC1994830964</controlfield><controlfield tag="003">DE-627</controlfield><controlfield tag="005">20220223210545.0</controlfield><controlfield tag="007">tu</controlfield><controlfield tag="008">170721s2017 xx ||||| 00| ||und c</controlfield><datafield tag="024" ind1="7" ind2=" "><subfield code="a">10.1121/1.4987204</subfield><subfield code="2">doi</subfield></datafield><datafield tag="028" ind1="5" ind2="2"><subfield code="a">PQ20170901</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-627)OLC1994830964</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-599)GBVOLC1994830964</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(PRQ)scitation_primary_10_1121_1_49872040</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(KEY)0112299120170000141000503468selectingframesforautomaticspeechrecognitionbasedo</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-627</subfield><subfield code="b">ger</subfield><subfield code="c">DE-627</subfield><subfield code="e">rakwb</subfield></datafield><datafield tag="082" ind1="0" ind2="4"><subfield code="a">530</subfield><subfield code="q">DE-600</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">LING</subfield><subfield code="2">fid</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">EQ 1000:</subfield><subfield code="q">AVZ</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">33.12</subfield><subfield code="2">bkl</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">50.36</subfield><subfield code="2">bkl</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">He, Di</subfield><subfield code="e">verfasserin</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Selecting frames for automatic speech recognition based on acoustic landmarks</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="c">2017</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="a">Text</subfield><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="a">ohne Hilfsmittel zu benutzen</subfield><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="a">Band</subfield><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="520" ind1=" " ind2=" "><subfield code="a">Most mainstream Mel-frequency cepstral coefficient (MFCC) based Automatic Speech Recognition (ASR) systems consider all feature frames equally important. However, the acoustic landmark theory disagrees with this idea. Acoustic landmark theory exploits the quantal non-linear articulatory-acoustic relationships from human speech perception experiments and provides a theoretical basis of extracting acoustic features in the vicinity of landmark regions where an abrupt change occurs in the spectrum of speech signals. In this work, we conducted experiments, using the TIMIT corpus, on both GMM and DNN based ASR systems and found that frames containing landmarks are more informative than others during the recognition process. We proved that altering the level of emphasis on landmark and non-landmark frames, through re-weighting or removing frame acoustic likelihoods accordingly, can change the phone error rate (PER) of the ASR system in a way dramatically different from making similar changes to random frames. Furthermore, by leveraging the landmark as a heuristic, one of our hybrid DNN frame dropping strategies achieved a PER increment of 0.44% when only scoring less than half, 41.2% to be precise, of the frames. This hybrid strategy out-performs other non-heuristic-based methods and demonstrated the potential of landmarks for computational reduction for ASR.</subfield></datafield><datafield tag="540" ind1=" " ind2=" "><subfield code="a">Nutzungsrecht: © Acoustical Society of America</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Lim, Boon Pang P</subfield><subfield code="4">oth</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Yang, Xuesong</subfield><subfield code="4">oth</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Hasegawa-Johnson, Mark</subfield><subfield code="4">oth</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Chen, Deming</subfield><subfield code="4">oth</subfield></datafield><datafield tag="773" ind1="0" ind2="8"><subfield code="i">Enthalten in</subfield><subfield code="t">The journal of the Acoustical Society of America</subfield><subfield code="d">Melville, NY : AIP, 1929</subfield><subfield code="g">141(2017), 5, Seite 3468-3468</subfield><subfield code="w">(DE-627)129550264</subfield><subfield code="w">(DE-600)219231-7</subfield><subfield code="w">(DE-576)015003663</subfield><subfield code="x">0001-4966</subfield><subfield code="7">nnns</subfield></datafield><datafield tag="773" ind1="1" ind2="8"><subfield code="g">volume:141</subfield><subfield code="g">year:2017</subfield><subfield code="g">number:5</subfield><subfield code="g">pages:3468-3468</subfield></datafield><datafield tag="856" ind1="4" ind2="1"><subfield code="u">http://dx.doi.org/10.1121/1.4987204</subfield><subfield code="3">Volltext</subfield></datafield><datafield tag="856" ind1="4" ind2="2"><subfield code="u">http://dx.doi.org/10.1121/1.4987204</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_USEFLAG_A</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SYSFLAG_A</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_OLC</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">FID-LING</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SSG-OLC-PHY</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SSG-OLC-MUS</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_59</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_60</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_70</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_120</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_170</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_201</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2006</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2011</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2027</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2045</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2192</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2256</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4219</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4315</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4319</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4700</subfield></datafield><datafield tag="936" ind1="r" ind2="v"><subfield code="a">EQ 1000:</subfield></datafield><datafield tag="936" ind1="b" ind2="k"><subfield code="a">33.12</subfield><subfield code="q">AVZ</subfield></datafield><datafield tag="936" ind1="b" ind2="k"><subfield code="a">50.36</subfield><subfield code="q">AVZ</subfield></datafield><datafield tag="951" ind1=" " ind2=" "><subfield code="a">AR</subfield></datafield><datafield tag="952" ind1=" " ind2=" "><subfield code="d">141</subfield><subfield code="j">2017</subfield><subfield code="e">5</subfield><subfield code="h">3468-3468</subfield></datafield></record></collection>
|
author |
He, Di |
spellingShingle |
He, Di ddc 530 fid LING rvk EQ 1000: bkl 33.12 bkl 50.36 Selecting frames for automatic speech recognition based on acoustic landmarks |
authorStr |
He, Di |
ppnlink_with_tag_str_mv |
@@773@@(DE-627)129550264 |
format |
Article |
dewey-ones |
530 - Physics |
delete_txt_mv |
keep |
author_role |
aut |
collection |
OLC |
remote_str |
false |
illustrated |
Not Illustrated |
issn |
0001-4966 |
topic_title |
530 DE-600 LING fid EQ 1000: AVZ rvk 33.12 bkl 50.36 bkl Selecting frames for automatic speech recognition based on acoustic landmarks |
topic |
ddc 530 fid LING rvk EQ 1000: bkl 33.12 bkl 50.36 |
topic_unstemmed |
ddc 530 fid LING rvk EQ 1000: bkl 33.12 bkl 50.36 |
topic_browse |
ddc 530 fid LING rvk EQ 1000: bkl 33.12 bkl 50.36 |
format_facet |
Aufsätze Gedruckte Aufsätze |
format_main_str_mv |
Text Zeitschrift/Artikel |
carriertype_str_mv |
nc |
author2_variant |
b p p l bpp bppl x y xy m h j mhj d c dc |
hierarchy_parent_title |
The journal of the Acoustical Society of America |
hierarchy_parent_id |
129550264 |
dewey-tens |
530 - Physics |
hierarchy_top_title |
The journal of the Acoustical Society of America |
isfreeaccess_txt |
false |
familylinks_str_mv |
(DE-627)129550264 (DE-600)219231-7 (DE-576)015003663 |
title |
Selecting frames for automatic speech recognition based on acoustic landmarks |
ctrlnum |
(DE-627)OLC1994830964 (DE-599)GBVOLC1994830964 (PRQ)scitation_primary_10_1121_1_49872040 (KEY)0112299120170000141000503468selectingframesforautomaticspeechrecognitionbasedo |
title_full |
Selecting frames for automatic speech recognition based on acoustic landmarks |
author_sort |
He, Di |
journal |
The journal of the Acoustical Society of America |
journalStr |
The journal of the Acoustical Society of America |
isOA_bool |
false |
dewey-hundreds |
500 - Science |
recordtype |
marc |
publishDateSort |
2017 |
contenttype_str_mv |
txt |
container_start_page |
3468 |
author_browse |
He, Di |
container_volume |
141 |
class |
530 DE-600 LING fid EQ 1000: AVZ rvk 33.12 bkl 50.36 bkl |
format_se |
Aufsätze |
author-letter |
He, Di |
doi_str_mv |
10.1121/1.4987204 |
dewey-full |
530 |
title_sort |
selecting frames for automatic speech recognition based on acoustic landmarks |
title_auth |
Selecting frames for automatic speech recognition based on acoustic landmarks |
abstract |
Most mainstream Mel-frequency cepstral coefficient (MFCC) based Automatic Speech Recognition (ASR) systems consider all feature frames equally important. However, the acoustic landmark theory disagrees with this idea. Acoustic landmark theory exploits the quantal non-linear articulatory-acoustic relationships from human speech perception experiments and provides a theoretical basis of extracting acoustic features in the vicinity of landmark regions where an abrupt change occurs in the spectrum of speech signals. In this work, we conducted experiments, using the TIMIT corpus, on both GMM and DNN based ASR systems and found that frames containing landmarks are more informative than others during the recognition process. We proved that altering the level of emphasis on landmark and non-landmark frames, through re-weighting or removing frame acoustic likelihoods accordingly, can change the phone error rate (PER) of the ASR system in a way dramatically different from making similar changes to random frames. Furthermore, by leveraging the landmark as a heuristic, one of our hybrid DNN frame dropping strategies achieved a PER increment of 0.44% when only scoring less than half, 41.2% to be precise, of the frames. This hybrid strategy out-performs other non-heuristic-based methods and demonstrated the potential of landmarks for computational reduction for ASR. |
abstractGer |
Most mainstream Mel-frequency cepstral coefficient (MFCC) based Automatic Speech Recognition (ASR) systems consider all feature frames equally important. However, the acoustic landmark theory disagrees with this idea. Acoustic landmark theory exploits the quantal non-linear articulatory-acoustic relationships from human speech perception experiments and provides a theoretical basis of extracting acoustic features in the vicinity of landmark regions where an abrupt change occurs in the spectrum of speech signals. In this work, we conducted experiments, using the TIMIT corpus, on both GMM and DNN based ASR systems and found that frames containing landmarks are more informative than others during the recognition process. We proved that altering the level of emphasis on landmark and non-landmark frames, through re-weighting or removing frame acoustic likelihoods accordingly, can change the phone error rate (PER) of the ASR system in a way dramatically different from making similar changes to random frames. Furthermore, by leveraging the landmark as a heuristic, one of our hybrid DNN frame dropping strategies achieved a PER increment of 0.44% when only scoring less than half, 41.2% to be precise, of the frames. This hybrid strategy out-performs other non-heuristic-based methods and demonstrated the potential of landmarks for computational reduction for ASR. |
abstract_unstemmed |
Most mainstream Mel-frequency cepstral coefficient (MFCC) based Automatic Speech Recognition (ASR) systems consider all feature frames equally important. However, the acoustic landmark theory disagrees with this idea. Acoustic landmark theory exploits the quantal non-linear articulatory-acoustic relationships from human speech perception experiments and provides a theoretical basis of extracting acoustic features in the vicinity of landmark regions where an abrupt change occurs in the spectrum of speech signals. In this work, we conducted experiments, using the TIMIT corpus, on both GMM and DNN based ASR systems and found that frames containing landmarks are more informative than others during the recognition process. We proved that altering the level of emphasis on landmark and non-landmark frames, through re-weighting or removing frame acoustic likelihoods accordingly, can change the phone error rate (PER) of the ASR system in a way dramatically different from making similar changes to random frames. Furthermore, by leveraging the landmark as a heuristic, one of our hybrid DNN frame dropping strategies achieved a PER increment of 0.44% when only scoring less than half, 41.2% to be precise, of the frames. This hybrid strategy out-performs other non-heuristic-based methods and demonstrated the potential of landmarks for computational reduction for ASR. |
collection_details |
GBV_USEFLAG_A SYSFLAG_A GBV_OLC FID-LING SSG-OLC-PHY SSG-OLC-MUS GBV_ILN_59 GBV_ILN_60 GBV_ILN_70 GBV_ILN_120 GBV_ILN_170 GBV_ILN_201 GBV_ILN_2006 GBV_ILN_2011 GBV_ILN_2027 GBV_ILN_2045 GBV_ILN_2192 GBV_ILN_2256 GBV_ILN_4219 GBV_ILN_4315 GBV_ILN_4319 GBV_ILN_4700 |
container_issue |
5 |
title_short |
Selecting frames for automatic speech recognition based on acoustic landmarks |
url |
http://dx.doi.org/10.1121/1.4987204 |
remote_bool |
false |
author2 |
Lim, Boon Pang P Yang, Xuesong Hasegawa-Johnson, Mark Chen, Deming |
author2Str |
Lim, Boon Pang P Yang, Xuesong Hasegawa-Johnson, Mark Chen, Deming |
ppnlink |
129550264 |
mediatype_str_mv |
n |
isOA_txt |
false |
hochschulschrift_bool |
false |
author2_role |
oth oth oth oth |
doi_str |
10.1121/1.4987204 |
up_date |
2024-07-03T19:24:17.028Z |
_version_ |
1803587060528840704 |
fullrecord_marcxml |
<?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>01000caa a2200265 4500</leader><controlfield tag="001">OLC1994830964</controlfield><controlfield tag="003">DE-627</controlfield><controlfield tag="005">20220223210545.0</controlfield><controlfield tag="007">tu</controlfield><controlfield tag="008">170721s2017 xx ||||| 00| ||und c</controlfield><datafield tag="024" ind1="7" ind2=" "><subfield code="a">10.1121/1.4987204</subfield><subfield code="2">doi</subfield></datafield><datafield tag="028" ind1="5" ind2="2"><subfield code="a">PQ20170901</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-627)OLC1994830964</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-599)GBVOLC1994830964</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(PRQ)scitation_primary_10_1121_1_49872040</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(KEY)0112299120170000141000503468selectingframesforautomaticspeechrecognitionbasedo</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-627</subfield><subfield code="b">ger</subfield><subfield code="c">DE-627</subfield><subfield code="e">rakwb</subfield></datafield><datafield tag="082" ind1="0" ind2="4"><subfield code="a">530</subfield><subfield code="q">DE-600</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">LING</subfield><subfield code="2">fid</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">EQ 1000:</subfield><subfield code="q">AVZ</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">33.12</subfield><subfield code="2">bkl</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">50.36</subfield><subfield code="2">bkl</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">He, Di</subfield><subfield code="e">verfasserin</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Selecting frames for automatic speech recognition based on acoustic landmarks</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="c">2017</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="a">Text</subfield><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="a">ohne Hilfsmittel zu benutzen</subfield><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="a">Band</subfield><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="520" ind1=" " ind2=" "><subfield code="a">Most mainstream Mel-frequency cepstral coefficient (MFCC) based Automatic Speech Recognition (ASR) systems consider all feature frames equally important. However, the acoustic landmark theory disagrees with this idea. Acoustic landmark theory exploits the quantal non-linear articulatory-acoustic relationships from human speech perception experiments and provides a theoretical basis of extracting acoustic features in the vicinity of landmark regions where an abrupt change occurs in the spectrum of speech signals. In this work, we conducted experiments, using the TIMIT corpus, on both GMM and DNN based ASR systems and found that frames containing landmarks are more informative than others during the recognition process. We proved that altering the level of emphasis on landmark and non-landmark frames, through re-weighting or removing frame acoustic likelihoods accordingly, can change the phone error rate (PER) of the ASR system in a way dramatically different from making similar changes to random frames. Furthermore, by leveraging the landmark as a heuristic, one of our hybrid DNN frame dropping strategies achieved a PER increment of 0.44% when only scoring less than half, 41.2% to be precise, of the frames. This hybrid strategy out-performs other non-heuristic-based methods and demonstrated the potential of landmarks for computational reduction for ASR.</subfield></datafield><datafield tag="540" ind1=" " ind2=" "><subfield code="a">Nutzungsrecht: © Acoustical Society of America</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Lim, Boon Pang P</subfield><subfield code="4">oth</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Yang, Xuesong</subfield><subfield code="4">oth</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Hasegawa-Johnson, Mark</subfield><subfield code="4">oth</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Chen, Deming</subfield><subfield code="4">oth</subfield></datafield><datafield tag="773" ind1="0" ind2="8"><subfield code="i">Enthalten in</subfield><subfield code="t">The journal of the Acoustical Society of America</subfield><subfield code="d">Melville, NY : AIP, 1929</subfield><subfield code="g">141(2017), 5, Seite 3468-3468</subfield><subfield code="w">(DE-627)129550264</subfield><subfield code="w">(DE-600)219231-7</subfield><subfield code="w">(DE-576)015003663</subfield><subfield code="x">0001-4966</subfield><subfield code="7">nnns</subfield></datafield><datafield tag="773" ind1="1" ind2="8"><subfield code="g">volume:141</subfield><subfield code="g">year:2017</subfield><subfield code="g">number:5</subfield><subfield code="g">pages:3468-3468</subfield></datafield><datafield tag="856" ind1="4" ind2="1"><subfield code="u">http://dx.doi.org/10.1121/1.4987204</subfield><subfield code="3">Volltext</subfield></datafield><datafield tag="856" ind1="4" ind2="2"><subfield code="u">http://dx.doi.org/10.1121/1.4987204</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_USEFLAG_A</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SYSFLAG_A</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_OLC</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">FID-LING</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SSG-OLC-PHY</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SSG-OLC-MUS</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_59</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_60</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_70</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_120</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_170</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_201</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2006</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2011</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2027</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2045</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2192</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_2256</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4219</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4315</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4319</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_ILN_4700</subfield></datafield><datafield tag="936" ind1="r" ind2="v"><subfield code="a">EQ 1000:</subfield></datafield><datafield tag="936" ind1="b" ind2="k"><subfield code="a">33.12</subfield><subfield code="q">AVZ</subfield></datafield><datafield tag="936" ind1="b" ind2="k"><subfield code="a">50.36</subfield><subfield code="q">AVZ</subfield></datafield><datafield tag="951" ind1=" " ind2=" "><subfield code="a">AR</subfield></datafield><datafield tag="952" ind1=" " ind2=" "><subfield code="d">141</subfield><subfield code="j">2017</subfield><subfield code="e">5</subfield><subfield code="h">3468-3468</subfield></datafield></record></collection>
|
score |
7.401681 |