Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering
Abstract Current Medical Image Visual Question Answering (Med-VQA) models often tend to exploit language bias instead of learning the multimodal features from both vision and language, which often suffers from the sparse data and bad performance. In this paper, we propose a new pre-trained multileve...
Ausführliche Beschreibung
Autor*in: |
Cai, Linqin [verfasserIn] |
---|
Format: |
Artikel |
---|---|
Sprache: |
Englisch |
Erschienen: |
2023 |
---|
Schlagwörter: |
Medical image visual question answering |
---|
Anmerkung: |
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. |
---|
Übergeordnetes Werk: |
Enthalten in: The journal of supercomputing - Springer US, 1987, 79(2023), 12 vom: 29. März, Seite 13696-13723 |
---|---|
Übergeordnetes Werk: |
volume:79 ; year:2023 ; number:12 ; day:29 ; month:03 ; pages:13696-13723 |
Links: |
---|
DOI / URN: |
10.1007/s11227-023-05195-2 |
---|
Katalog-ID: |
OLC2143981376 |
---|
LEADER | 01000naa a22002652 4500 | ||
---|---|---|---|
001 | OLC2143981376 | ||
003 | DE-627 | ||
005 | 20240118092728.0 | ||
007 | tu | ||
008 | 240118s2023 xx ||||| 00| ||eng c | ||
024 | 7 | |a 10.1007/s11227-023-05195-2 |2 doi | |
035 | |a (DE-627)OLC2143981376 | ||
035 | |a (DE-He213)s11227-023-05195-2-p | ||
040 | |a DE-627 |b ger |c DE-627 |e rakwb | ||
041 | |a eng | ||
082 | 0 | 4 | |a 004 |a 620 |q VZ |
100 | 1 | |a Cai, Linqin |e verfasserin |4 aut | |
245 | 1 | 0 | |a Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering |
264 | 1 | |c 2023 | |
336 | |a Text |b txt |2 rdacontent | ||
337 | |a ohne Hilfsmittel zu benutzen |b n |2 rdamedia | ||
338 | |a Band |b nc |2 rdacarrier | ||
500 | |a © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. | ||
520 | |a Abstract Current Medical Image Visual Question Answering (Med-VQA) models often tend to exploit language bias instead of learning the multimodal features from both vision and language, which often suffers from the sparse data and bad performance. In this paper, we propose a new pre-trained multilevel fusion network based on Vision-conditioned reasoning and Bilinear attentions for Med-VQA (VB-MVQA). To augment vision data, we firstly incorporate Contrastive Language-Image Pre-training (CLIP) and attention mechanisms for effectively extracting medical image features. And then, the proposed VB-MVQA model applies multiple stacked attention layers and Bilinear Attention Network (BAN) to fuse the extracted image features and the question features extracted by Bidirectional Long Short-Term Memory(Bi-LSTM). On this basis, the proposed VB-MVQA model introduces vision-conditioned reasoning to guide the importance selection over multimodal fused features and further enhance the image semantic information for eliminating the language bias. Extensive experiments on three public benchmark datasets (VQA-RAD, SLAKE, and VQA-Med-2019) show that the proposed model outperforms state-of-the-art models by an average improvement of 11.08%, 5.28%, and 8.30%, and our proposed method achieves more significant accuracy than the baseline models for open-ended questions and more powerful for language-bias Med-VQA datasets. | ||
650 | 4 | |a Medical image visual question answering | |
650 | 4 | |a Vision conditional reasoning | |
650 | 4 | |a Contrastive language-image pre-training | |
650 | 4 | |a Transfer learning | |
700 | 1 | |a Fang, Haodu |4 aut | |
700 | 1 | |a Li, Zhiqing |4 aut | |
773 | 0 | 8 | |i Enthalten in |t The journal of supercomputing |d Springer US, 1987 |g 79(2023), 12 vom: 29. März, Seite 13696-13723 |w (DE-627)13046466X |w (DE-600)740510-8 |w (DE-576)018667775 |x 0920-8542 |7 nnns |
773 | 1 | 8 | |g volume:79 |g year:2023 |g number:12 |g day:29 |g month:03 |g pages:13696-13723 |
856 | 4 | 1 | |u https://doi.org/10.1007/s11227-023-05195-2 |z lizenzpflichtig |3 Volltext |
912 | |a GBV_USEFLAG_A | ||
912 | |a SYSFLAG_A | ||
912 | |a GBV_OLC | ||
912 | |a SSG-OLC-TEC | ||
912 | |a SSG-OLC-MAT | ||
951 | |a AR | ||
952 | |d 79 |j 2023 |e 12 |b 29 |c 03 |h 13696-13723 |
author_variant |
l c lc h f hf z l zl |
---|---|
matchkey_str |
article:09208542:2023----::rtandutlvlueewrbsdniinodtoeraoignblnaatninfre |
hierarchy_sort_str |
2023 |
publishDate |
2023 |
allfields |
10.1007/s11227-023-05195-2 doi (DE-627)OLC2143981376 (DE-He213)s11227-023-05195-2-p DE-627 ger DE-627 rakwb eng 004 620 VZ Cai, Linqin verfasserin aut Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering 2023 Text txt rdacontent ohne Hilfsmittel zu benutzen n rdamedia Band nc rdacarrier © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. Abstract Current Medical Image Visual Question Answering (Med-VQA) models often tend to exploit language bias instead of learning the multimodal features from both vision and language, which often suffers from the sparse data and bad performance. In this paper, we propose a new pre-trained multilevel fusion network based on Vision-conditioned reasoning and Bilinear attentions for Med-VQA (VB-MVQA). To augment vision data, we firstly incorporate Contrastive Language-Image Pre-training (CLIP) and attention mechanisms for effectively extracting medical image features. And then, the proposed VB-MVQA model applies multiple stacked attention layers and Bilinear Attention Network (BAN) to fuse the extracted image features and the question features extracted by Bidirectional Long Short-Term Memory(Bi-LSTM). On this basis, the proposed VB-MVQA model introduces vision-conditioned reasoning to guide the importance selection over multimodal fused features and further enhance the image semantic information for eliminating the language bias. Extensive experiments on three public benchmark datasets (VQA-RAD, SLAKE, and VQA-Med-2019) show that the proposed model outperforms state-of-the-art models by an average improvement of 11.08%, 5.28%, and 8.30%, and our proposed method achieves more significant accuracy than the baseline models for open-ended questions and more powerful for language-bias Med-VQA datasets. Medical image visual question answering Vision conditional reasoning Contrastive language-image pre-training Transfer learning Fang, Haodu aut Li, Zhiqing aut Enthalten in The journal of supercomputing Springer US, 1987 79(2023), 12 vom: 29. März, Seite 13696-13723 (DE-627)13046466X (DE-600)740510-8 (DE-576)018667775 0920-8542 nnns volume:79 year:2023 number:12 day:29 month:03 pages:13696-13723 https://doi.org/10.1007/s11227-023-05195-2 lizenzpflichtig Volltext GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-TEC SSG-OLC-MAT AR 79 2023 12 29 03 13696-13723 |
spelling |
10.1007/s11227-023-05195-2 doi (DE-627)OLC2143981376 (DE-He213)s11227-023-05195-2-p DE-627 ger DE-627 rakwb eng 004 620 VZ Cai, Linqin verfasserin aut Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering 2023 Text txt rdacontent ohne Hilfsmittel zu benutzen n rdamedia Band nc rdacarrier © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. Abstract Current Medical Image Visual Question Answering (Med-VQA) models often tend to exploit language bias instead of learning the multimodal features from both vision and language, which often suffers from the sparse data and bad performance. In this paper, we propose a new pre-trained multilevel fusion network based on Vision-conditioned reasoning and Bilinear attentions for Med-VQA (VB-MVQA). To augment vision data, we firstly incorporate Contrastive Language-Image Pre-training (CLIP) and attention mechanisms for effectively extracting medical image features. And then, the proposed VB-MVQA model applies multiple stacked attention layers and Bilinear Attention Network (BAN) to fuse the extracted image features and the question features extracted by Bidirectional Long Short-Term Memory(Bi-LSTM). On this basis, the proposed VB-MVQA model introduces vision-conditioned reasoning to guide the importance selection over multimodal fused features and further enhance the image semantic information for eliminating the language bias. Extensive experiments on three public benchmark datasets (VQA-RAD, SLAKE, and VQA-Med-2019) show that the proposed model outperforms state-of-the-art models by an average improvement of 11.08%, 5.28%, and 8.30%, and our proposed method achieves more significant accuracy than the baseline models for open-ended questions and more powerful for language-bias Med-VQA datasets. Medical image visual question answering Vision conditional reasoning Contrastive language-image pre-training Transfer learning Fang, Haodu aut Li, Zhiqing aut Enthalten in The journal of supercomputing Springer US, 1987 79(2023), 12 vom: 29. März, Seite 13696-13723 (DE-627)13046466X (DE-600)740510-8 (DE-576)018667775 0920-8542 nnns volume:79 year:2023 number:12 day:29 month:03 pages:13696-13723 https://doi.org/10.1007/s11227-023-05195-2 lizenzpflichtig Volltext GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-TEC SSG-OLC-MAT AR 79 2023 12 29 03 13696-13723 |
allfields_unstemmed |
10.1007/s11227-023-05195-2 doi (DE-627)OLC2143981376 (DE-He213)s11227-023-05195-2-p DE-627 ger DE-627 rakwb eng 004 620 VZ Cai, Linqin verfasserin aut Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering 2023 Text txt rdacontent ohne Hilfsmittel zu benutzen n rdamedia Band nc rdacarrier © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. Abstract Current Medical Image Visual Question Answering (Med-VQA) models often tend to exploit language bias instead of learning the multimodal features from both vision and language, which often suffers from the sparse data and bad performance. In this paper, we propose a new pre-trained multilevel fusion network based on Vision-conditioned reasoning and Bilinear attentions for Med-VQA (VB-MVQA). To augment vision data, we firstly incorporate Contrastive Language-Image Pre-training (CLIP) and attention mechanisms for effectively extracting medical image features. And then, the proposed VB-MVQA model applies multiple stacked attention layers and Bilinear Attention Network (BAN) to fuse the extracted image features and the question features extracted by Bidirectional Long Short-Term Memory(Bi-LSTM). On this basis, the proposed VB-MVQA model introduces vision-conditioned reasoning to guide the importance selection over multimodal fused features and further enhance the image semantic information for eliminating the language bias. Extensive experiments on three public benchmark datasets (VQA-RAD, SLAKE, and VQA-Med-2019) show that the proposed model outperforms state-of-the-art models by an average improvement of 11.08%, 5.28%, and 8.30%, and our proposed method achieves more significant accuracy than the baseline models for open-ended questions and more powerful for language-bias Med-VQA datasets. Medical image visual question answering Vision conditional reasoning Contrastive language-image pre-training Transfer learning Fang, Haodu aut Li, Zhiqing aut Enthalten in The journal of supercomputing Springer US, 1987 79(2023), 12 vom: 29. März, Seite 13696-13723 (DE-627)13046466X (DE-600)740510-8 (DE-576)018667775 0920-8542 nnns volume:79 year:2023 number:12 day:29 month:03 pages:13696-13723 https://doi.org/10.1007/s11227-023-05195-2 lizenzpflichtig Volltext GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-TEC SSG-OLC-MAT AR 79 2023 12 29 03 13696-13723 |
allfieldsGer |
10.1007/s11227-023-05195-2 doi (DE-627)OLC2143981376 (DE-He213)s11227-023-05195-2-p DE-627 ger DE-627 rakwb eng 004 620 VZ Cai, Linqin verfasserin aut Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering 2023 Text txt rdacontent ohne Hilfsmittel zu benutzen n rdamedia Band nc rdacarrier © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. Abstract Current Medical Image Visual Question Answering (Med-VQA) models often tend to exploit language bias instead of learning the multimodal features from both vision and language, which often suffers from the sparse data and bad performance. In this paper, we propose a new pre-trained multilevel fusion network based on Vision-conditioned reasoning and Bilinear attentions for Med-VQA (VB-MVQA). To augment vision data, we firstly incorporate Contrastive Language-Image Pre-training (CLIP) and attention mechanisms for effectively extracting medical image features. And then, the proposed VB-MVQA model applies multiple stacked attention layers and Bilinear Attention Network (BAN) to fuse the extracted image features and the question features extracted by Bidirectional Long Short-Term Memory(Bi-LSTM). On this basis, the proposed VB-MVQA model introduces vision-conditioned reasoning to guide the importance selection over multimodal fused features and further enhance the image semantic information for eliminating the language bias. Extensive experiments on three public benchmark datasets (VQA-RAD, SLAKE, and VQA-Med-2019) show that the proposed model outperforms state-of-the-art models by an average improvement of 11.08%, 5.28%, and 8.30%, and our proposed method achieves more significant accuracy than the baseline models for open-ended questions and more powerful for language-bias Med-VQA datasets. Medical image visual question answering Vision conditional reasoning Contrastive language-image pre-training Transfer learning Fang, Haodu aut Li, Zhiqing aut Enthalten in The journal of supercomputing Springer US, 1987 79(2023), 12 vom: 29. März, Seite 13696-13723 (DE-627)13046466X (DE-600)740510-8 (DE-576)018667775 0920-8542 nnns volume:79 year:2023 number:12 day:29 month:03 pages:13696-13723 https://doi.org/10.1007/s11227-023-05195-2 lizenzpflichtig Volltext GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-TEC SSG-OLC-MAT AR 79 2023 12 29 03 13696-13723 |
allfieldsSound |
10.1007/s11227-023-05195-2 doi (DE-627)OLC2143981376 (DE-He213)s11227-023-05195-2-p DE-627 ger DE-627 rakwb eng 004 620 VZ Cai, Linqin verfasserin aut Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering 2023 Text txt rdacontent ohne Hilfsmittel zu benutzen n rdamedia Band nc rdacarrier © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. Abstract Current Medical Image Visual Question Answering (Med-VQA) models often tend to exploit language bias instead of learning the multimodal features from both vision and language, which often suffers from the sparse data and bad performance. In this paper, we propose a new pre-trained multilevel fusion network based on Vision-conditioned reasoning and Bilinear attentions for Med-VQA (VB-MVQA). To augment vision data, we firstly incorporate Contrastive Language-Image Pre-training (CLIP) and attention mechanisms for effectively extracting medical image features. And then, the proposed VB-MVQA model applies multiple stacked attention layers and Bilinear Attention Network (BAN) to fuse the extracted image features and the question features extracted by Bidirectional Long Short-Term Memory(Bi-LSTM). On this basis, the proposed VB-MVQA model introduces vision-conditioned reasoning to guide the importance selection over multimodal fused features and further enhance the image semantic information for eliminating the language bias. Extensive experiments on three public benchmark datasets (VQA-RAD, SLAKE, and VQA-Med-2019) show that the proposed model outperforms state-of-the-art models by an average improvement of 11.08%, 5.28%, and 8.30%, and our proposed method achieves more significant accuracy than the baseline models for open-ended questions and more powerful for language-bias Med-VQA datasets. Medical image visual question answering Vision conditional reasoning Contrastive language-image pre-training Transfer learning Fang, Haodu aut Li, Zhiqing aut Enthalten in The journal of supercomputing Springer US, 1987 79(2023), 12 vom: 29. März, Seite 13696-13723 (DE-627)13046466X (DE-600)740510-8 (DE-576)018667775 0920-8542 nnns volume:79 year:2023 number:12 day:29 month:03 pages:13696-13723 https://doi.org/10.1007/s11227-023-05195-2 lizenzpflichtig Volltext GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-TEC SSG-OLC-MAT AR 79 2023 12 29 03 13696-13723 |
language |
English |
source |
Enthalten in The journal of supercomputing 79(2023), 12 vom: 29. März, Seite 13696-13723 volume:79 year:2023 number:12 day:29 month:03 pages:13696-13723 |
sourceStr |
Enthalten in The journal of supercomputing 79(2023), 12 vom: 29. März, Seite 13696-13723 volume:79 year:2023 number:12 day:29 month:03 pages:13696-13723 |
format_phy_str_mv |
Article |
institution |
findex.gbv.de |
topic_facet |
Medical image visual question answering Vision conditional reasoning Contrastive language-image pre-training Transfer learning |
dewey-raw |
004 |
isfreeaccess_bool |
false |
container_title |
The journal of supercomputing |
authorswithroles_txt_mv |
Cai, Linqin @@aut@@ Fang, Haodu @@aut@@ Li, Zhiqing @@aut@@ |
publishDateDaySort_date |
2023-03-29T00:00:00Z |
hierarchy_top_id |
13046466X |
dewey-sort |
14 |
id |
OLC2143981376 |
language_de |
englisch |
fullrecord |
<?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>01000naa a22002652 4500</leader><controlfield tag="001">OLC2143981376</controlfield><controlfield tag="003">DE-627</controlfield><controlfield tag="005">20240118092728.0</controlfield><controlfield tag="007">tu</controlfield><controlfield tag="008">240118s2023 xx ||||| 00| ||eng c</controlfield><datafield tag="024" ind1="7" ind2=" "><subfield code="a">10.1007/s11227-023-05195-2</subfield><subfield code="2">doi</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-627)OLC2143981376</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-He213)s11227-023-05195-2-p</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-627</subfield><subfield code="b">ger</subfield><subfield code="c">DE-627</subfield><subfield code="e">rakwb</subfield></datafield><datafield tag="041" ind1=" " ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="082" ind1="0" ind2="4"><subfield code="a">004</subfield><subfield code="a">620</subfield><subfield code="q">VZ</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Cai, Linqin</subfield><subfield code="e">verfasserin</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="c">2023</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="a">Text</subfield><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="a">ohne Hilfsmittel zu benutzen</subfield><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="a">Band</subfield><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="500" ind1=" " ind2=" "><subfield code="a">© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.</subfield></datafield><datafield tag="520" ind1=" " ind2=" "><subfield code="a">Abstract Current Medical Image Visual Question Answering (Med-VQA) models often tend to exploit language bias instead of learning the multimodal features from both vision and language, which often suffers from the sparse data and bad performance. In this paper, we propose a new pre-trained multilevel fusion network based on Vision-conditioned reasoning and Bilinear attentions for Med-VQA (VB-MVQA). To augment vision data, we firstly incorporate Contrastive Language-Image Pre-training (CLIP) and attention mechanisms for effectively extracting medical image features. And then, the proposed VB-MVQA model applies multiple stacked attention layers and Bilinear Attention Network (BAN) to fuse the extracted image features and the question features extracted by Bidirectional Long Short-Term Memory(Bi-LSTM). On this basis, the proposed VB-MVQA model introduces vision-conditioned reasoning to guide the importance selection over multimodal fused features and further enhance the image semantic information for eliminating the language bias. Extensive experiments on three public benchmark datasets (VQA-RAD, SLAKE, and VQA-Med-2019) show that the proposed model outperforms state-of-the-art models by an average improvement of 11.08%, 5.28%, and 8.30%, and our proposed method achieves more significant accuracy than the baseline models for open-ended questions and more powerful for language-bias Med-VQA datasets.</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Medical image visual question answering</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Vision conditional reasoning</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Contrastive language-image pre-training</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Transfer learning</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Fang, Haodu</subfield><subfield code="4">aut</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Li, Zhiqing</subfield><subfield code="4">aut</subfield></datafield><datafield tag="773" ind1="0" ind2="8"><subfield code="i">Enthalten in</subfield><subfield code="t">The journal of supercomputing</subfield><subfield code="d">Springer US, 1987</subfield><subfield code="g">79(2023), 12 vom: 29. März, Seite 13696-13723</subfield><subfield code="w">(DE-627)13046466X</subfield><subfield code="w">(DE-600)740510-8</subfield><subfield code="w">(DE-576)018667775</subfield><subfield code="x">0920-8542</subfield><subfield code="7">nnns</subfield></datafield><datafield tag="773" ind1="1" ind2="8"><subfield code="g">volume:79</subfield><subfield code="g">year:2023</subfield><subfield code="g">number:12</subfield><subfield code="g">day:29</subfield><subfield code="g">month:03</subfield><subfield code="g">pages:13696-13723</subfield></datafield><datafield tag="856" ind1="4" ind2="1"><subfield code="u">https://doi.org/10.1007/s11227-023-05195-2</subfield><subfield code="z">lizenzpflichtig</subfield><subfield code="3">Volltext</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_USEFLAG_A</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SYSFLAG_A</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_OLC</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SSG-OLC-TEC</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SSG-OLC-MAT</subfield></datafield><datafield tag="951" ind1=" " ind2=" "><subfield code="a">AR</subfield></datafield><datafield tag="952" ind1=" " ind2=" "><subfield code="d">79</subfield><subfield code="j">2023</subfield><subfield code="e">12</subfield><subfield code="b">29</subfield><subfield code="c">03</subfield><subfield code="h">13696-13723</subfield></datafield></record></collection>
|
author |
Cai, Linqin |
spellingShingle |
Cai, Linqin ddc 004 misc Medical image visual question answering misc Vision conditional reasoning misc Contrastive language-image pre-training misc Transfer learning Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering |
authorStr |
Cai, Linqin |
ppnlink_with_tag_str_mv |
@@773@@(DE-627)13046466X |
format |
Article |
dewey-ones |
004 - Data processing & computer science 620 - Engineering & allied operations |
delete_txt_mv |
keep |
author_role |
aut aut aut |
collection |
OLC |
remote_str |
false |
illustrated |
Not Illustrated |
issn |
0920-8542 |
topic_title |
004 620 VZ Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering Medical image visual question answering Vision conditional reasoning Contrastive language-image pre-training Transfer learning |
topic |
ddc 004 misc Medical image visual question answering misc Vision conditional reasoning misc Contrastive language-image pre-training misc Transfer learning |
topic_unstemmed |
ddc 004 misc Medical image visual question answering misc Vision conditional reasoning misc Contrastive language-image pre-training misc Transfer learning |
topic_browse |
ddc 004 misc Medical image visual question answering misc Vision conditional reasoning misc Contrastive language-image pre-training misc Transfer learning |
format_facet |
Aufsätze Gedruckte Aufsätze |
format_main_str_mv |
Text Zeitschrift/Artikel |
carriertype_str_mv |
nc |
hierarchy_parent_title |
The journal of supercomputing |
hierarchy_parent_id |
13046466X |
dewey-tens |
000 - Computer science, knowledge & systems 620 - Engineering |
hierarchy_top_title |
The journal of supercomputing |
isfreeaccess_txt |
false |
familylinks_str_mv |
(DE-627)13046466X (DE-600)740510-8 (DE-576)018667775 |
title |
Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering |
ctrlnum |
(DE-627)OLC2143981376 (DE-He213)s11227-023-05195-2-p |
title_full |
Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering |
author_sort |
Cai, Linqin |
journal |
The journal of supercomputing |
journalStr |
The journal of supercomputing |
lang_code |
eng |
isOA_bool |
false |
dewey-hundreds |
000 - Computer science, information & general works 600 - Technology |
recordtype |
marc |
publishDateSort |
2023 |
contenttype_str_mv |
txt |
container_start_page |
13696 |
author_browse |
Cai, Linqin Fang, Haodu Li, Zhiqing |
container_volume |
79 |
class |
004 620 VZ |
format_se |
Aufsätze |
author-letter |
Cai, Linqin |
doi_str_mv |
10.1007/s11227-023-05195-2 |
dewey-full |
004 620 |
title_sort |
pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering |
title_auth |
Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering |
abstract |
Abstract Current Medical Image Visual Question Answering (Med-VQA) models often tend to exploit language bias instead of learning the multimodal features from both vision and language, which often suffers from the sparse data and bad performance. In this paper, we propose a new pre-trained multilevel fusion network based on Vision-conditioned reasoning and Bilinear attentions for Med-VQA (VB-MVQA). To augment vision data, we firstly incorporate Contrastive Language-Image Pre-training (CLIP) and attention mechanisms for effectively extracting medical image features. And then, the proposed VB-MVQA model applies multiple stacked attention layers and Bilinear Attention Network (BAN) to fuse the extracted image features and the question features extracted by Bidirectional Long Short-Term Memory(Bi-LSTM). On this basis, the proposed VB-MVQA model introduces vision-conditioned reasoning to guide the importance selection over multimodal fused features and further enhance the image semantic information for eliminating the language bias. Extensive experiments on three public benchmark datasets (VQA-RAD, SLAKE, and VQA-Med-2019) show that the proposed model outperforms state-of-the-art models by an average improvement of 11.08%, 5.28%, and 8.30%, and our proposed method achieves more significant accuracy than the baseline models for open-ended questions and more powerful for language-bias Med-VQA datasets. © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. |
abstractGer |
Abstract Current Medical Image Visual Question Answering (Med-VQA) models often tend to exploit language bias instead of learning the multimodal features from both vision and language, which often suffers from the sparse data and bad performance. In this paper, we propose a new pre-trained multilevel fusion network based on Vision-conditioned reasoning and Bilinear attentions for Med-VQA (VB-MVQA). To augment vision data, we firstly incorporate Contrastive Language-Image Pre-training (CLIP) and attention mechanisms for effectively extracting medical image features. And then, the proposed VB-MVQA model applies multiple stacked attention layers and Bilinear Attention Network (BAN) to fuse the extracted image features and the question features extracted by Bidirectional Long Short-Term Memory(Bi-LSTM). On this basis, the proposed VB-MVQA model introduces vision-conditioned reasoning to guide the importance selection over multimodal fused features and further enhance the image semantic information for eliminating the language bias. Extensive experiments on three public benchmark datasets (VQA-RAD, SLAKE, and VQA-Med-2019) show that the proposed model outperforms state-of-the-art models by an average improvement of 11.08%, 5.28%, and 8.30%, and our proposed method achieves more significant accuracy than the baseline models for open-ended questions and more powerful for language-bias Med-VQA datasets. © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. |
abstract_unstemmed |
Abstract Current Medical Image Visual Question Answering (Med-VQA) models often tend to exploit language bias instead of learning the multimodal features from both vision and language, which often suffers from the sparse data and bad performance. In this paper, we propose a new pre-trained multilevel fusion network based on Vision-conditioned reasoning and Bilinear attentions for Med-VQA (VB-MVQA). To augment vision data, we firstly incorporate Contrastive Language-Image Pre-training (CLIP) and attention mechanisms for effectively extracting medical image features. And then, the proposed VB-MVQA model applies multiple stacked attention layers and Bilinear Attention Network (BAN) to fuse the extracted image features and the question features extracted by Bidirectional Long Short-Term Memory(Bi-LSTM). On this basis, the proposed VB-MVQA model introduces vision-conditioned reasoning to guide the importance selection over multimodal fused features and further enhance the image semantic information for eliminating the language bias. Extensive experiments on three public benchmark datasets (VQA-RAD, SLAKE, and VQA-Med-2019) show that the proposed model outperforms state-of-the-art models by an average improvement of 11.08%, 5.28%, and 8.30%, and our proposed method achieves more significant accuracy than the baseline models for open-ended questions and more powerful for language-bias Med-VQA datasets. © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. |
collection_details |
GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-TEC SSG-OLC-MAT |
container_issue |
12 |
title_short |
Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering |
url |
https://doi.org/10.1007/s11227-023-05195-2 |
remote_bool |
false |
author2 |
Fang, Haodu Li, Zhiqing |
author2Str |
Fang, Haodu Li, Zhiqing |
ppnlink |
13046466X |
mediatype_str_mv |
n |
isOA_txt |
false |
hochschulschrift_bool |
false |
doi_str |
10.1007/s11227-023-05195-2 |
up_date |
2024-07-03T19:20:22.510Z |
_version_ |
1803586814618894336 |
fullrecord_marcxml |
<?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>01000naa a22002652 4500</leader><controlfield tag="001">OLC2143981376</controlfield><controlfield tag="003">DE-627</controlfield><controlfield tag="005">20240118092728.0</controlfield><controlfield tag="007">tu</controlfield><controlfield tag="008">240118s2023 xx ||||| 00| ||eng c</controlfield><datafield tag="024" ind1="7" ind2=" "><subfield code="a">10.1007/s11227-023-05195-2</subfield><subfield code="2">doi</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-627)OLC2143981376</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-He213)s11227-023-05195-2-p</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-627</subfield><subfield code="b">ger</subfield><subfield code="c">DE-627</subfield><subfield code="e">rakwb</subfield></datafield><datafield tag="041" ind1=" " ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="082" ind1="0" ind2="4"><subfield code="a">004</subfield><subfield code="a">620</subfield><subfield code="q">VZ</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Cai, Linqin</subfield><subfield code="e">verfasserin</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="c">2023</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="a">Text</subfield><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="a">ohne Hilfsmittel zu benutzen</subfield><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="a">Band</subfield><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="500" ind1=" " ind2=" "><subfield code="a">© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.</subfield></datafield><datafield tag="520" ind1=" " ind2=" "><subfield code="a">Abstract Current Medical Image Visual Question Answering (Med-VQA) models often tend to exploit language bias instead of learning the multimodal features from both vision and language, which often suffers from the sparse data and bad performance. In this paper, we propose a new pre-trained multilevel fusion network based on Vision-conditioned reasoning and Bilinear attentions for Med-VQA (VB-MVQA). To augment vision data, we firstly incorporate Contrastive Language-Image Pre-training (CLIP) and attention mechanisms for effectively extracting medical image features. And then, the proposed VB-MVQA model applies multiple stacked attention layers and Bilinear Attention Network (BAN) to fuse the extracted image features and the question features extracted by Bidirectional Long Short-Term Memory(Bi-LSTM). On this basis, the proposed VB-MVQA model introduces vision-conditioned reasoning to guide the importance selection over multimodal fused features and further enhance the image semantic information for eliminating the language bias. Extensive experiments on three public benchmark datasets (VQA-RAD, SLAKE, and VQA-Med-2019) show that the proposed model outperforms state-of-the-art models by an average improvement of 11.08%, 5.28%, and 8.30%, and our proposed method achieves more significant accuracy than the baseline models for open-ended questions and more powerful for language-bias Med-VQA datasets.</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Medical image visual question answering</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Vision conditional reasoning</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Contrastive language-image pre-training</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Transfer learning</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Fang, Haodu</subfield><subfield code="4">aut</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Li, Zhiqing</subfield><subfield code="4">aut</subfield></datafield><datafield tag="773" ind1="0" ind2="8"><subfield code="i">Enthalten in</subfield><subfield code="t">The journal of supercomputing</subfield><subfield code="d">Springer US, 1987</subfield><subfield code="g">79(2023), 12 vom: 29. März, Seite 13696-13723</subfield><subfield code="w">(DE-627)13046466X</subfield><subfield code="w">(DE-600)740510-8</subfield><subfield code="w">(DE-576)018667775</subfield><subfield code="x">0920-8542</subfield><subfield code="7">nnns</subfield></datafield><datafield tag="773" ind1="1" ind2="8"><subfield code="g">volume:79</subfield><subfield code="g">year:2023</subfield><subfield code="g">number:12</subfield><subfield code="g">day:29</subfield><subfield code="g">month:03</subfield><subfield code="g">pages:13696-13723</subfield></datafield><datafield tag="856" ind1="4" ind2="1"><subfield code="u">https://doi.org/10.1007/s11227-023-05195-2</subfield><subfield code="z">lizenzpflichtig</subfield><subfield code="3">Volltext</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_USEFLAG_A</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SYSFLAG_A</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_OLC</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SSG-OLC-TEC</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SSG-OLC-MAT</subfield></datafield><datafield tag="951" ind1=" " ind2=" "><subfield code="a">AR</subfield></datafield><datafield tag="952" ind1=" " ind2=" "><subfield code="d">79</subfield><subfield code="j">2023</subfield><subfield code="e">12</subfield><subfield code="b">29</subfield><subfield code="c">03</subfield><subfield code="h">13696-13723</subfield></datafield></record></collection>
|
score |
7.399617 |