Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering

Abstract Current Medical Image Visual Question Answering (Med-VQA) models often tend to exploit language bias instead of learning the multimodal features from both vision and language, which often suffers from the sparse data and bad performance. In this paper, we propose a new pre-trained multileve...
Ausführliche Beschreibung

Gespeichert in:

Autor*in:	Cai, Linqin [verfasserIn] Fang, Haodu Li, Zhiqing

Format:	Artikel
Sprache:	Englisch

Erschienen:	2023

Schlagwörter:	Medical image visual question answering Vision conditional reasoning Contrastive language-image pre-training Transfer learning

Anmerkung:	© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Übergeordnetes Werk:	Enthalten in: The journal of supercomputing - Springer US, 1987, 79(2023), 12 vom: 29. März, Seite 13696-13723
Übergeordnetes Werk:	volume:79 ; year:2023 ; number:12 ; day:29 ; month:03 ; pages:13696-13723

Links:	Volltext

DOI / URN:	10.1007/s11227-023-05195-2

Katalog-ID:	OLC2143981376

Internformat


LEADER	01000naa a22002652 4500
001	OLC2143981376
003	DE-627
005	20240118092728.0
007	tu
008	240118s2023 xx \|\|\|\|\| 00\| \|\|eng c
024	7		\|a 10.1007/s11227-023-05195-2 \|2 doi
035			\|a (DE-627)OLC2143981376
035			\|a (DE-He213)s11227-023-05195-2-p
040			\|a DE-627 \|b ger \|c DE-627 \|e rakwb
041			\|a eng
082	0	4	\|a 004 \|a 620 \|q VZ
100	1		\|a Cai, Linqin \|e verfasserin \|4 aut
245	1	0	\|a Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering
264		1	\|c 2023
336			\|a Text \|b txt \|2 rdacontent
337			\|a ohne Hilfsmittel zu benutzen \|b n \|2 rdamedia
338			\|a Band \|b nc \|2 rdacarrier
500			\|a © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
520			\|a Abstract Current Medical Image Visual Question Answering (Med-VQA) models often tend to exploit language bias instead of learning the multimodal features from both vision and language, which often suffers from the sparse data and bad performance. In this paper, we propose a new pre-trained multilevel fusion network based on Vision-conditioned reasoning and Bilinear attentions for Med-VQA (VB-MVQA). To augment vision data, we firstly incorporate Contrastive Language-Image Pre-training (CLIP) and attention mechanisms for effectively extracting medical image features. And then, the proposed VB-MVQA model applies multiple stacked attention layers and Bilinear Attention Network (BAN) to fuse the extracted image features and the question features extracted by Bidirectional Long Short-Term Memory(Bi-LSTM). On this basis, the proposed VB-MVQA model introduces vision-conditioned reasoning to guide the importance selection over multimodal fused features and further enhance the image semantic information for eliminating the language bias. Extensive experiments on three public benchmark datasets (VQA-RAD, SLAKE, and VQA-Med-2019) show that the proposed model outperforms state-of-the-art models by an average improvement of 11.08%, 5.28%, and 8.30%, and our proposed method achieves more significant accuracy than the baseline models for open-ended questions and more powerful for language-bias Med-VQA datasets.
650		4	\|a Medical image visual question answering
650		4	\|a Vision conditional reasoning
650		4	\|a Contrastive language-image pre-training
650		4	\|a Transfer learning
700	1		\|a Fang, Haodu \|4 aut
700	1		\|a Li, Zhiqing \|4 aut
773	0	8	\|i Enthalten in \|t The journal of supercomputing \|d Springer US, 1987 \|g 79(2023), 12 vom: 29. März, Seite 13696-13723 \|w (DE-627)13046466X \|w (DE-600)740510-8 \|w (DE-576)018667775 \|x 0920-8542 \|7 nnns
773	1	8	\|g volume:79 \|g year:2023 \|g number:12 \|g day:29 \|g month:03 \|g pages:13696-13723
856	4	1	\|u https://doi.org/10.1007/s11227-023-05195-2 \|z lizenzpflichtig \|3 Volltext
912			\|a GBV_USEFLAG_A
912			\|a SYSFLAG_A
912			\|a GBV_OLC
912			\|a SSG-OLC-TEC
912			\|a SSG-OLC-MAT
951			\|a AR
952			\|d 79 \|j 2023 \|e 12 \|b 29 \|c 03 \|h 13696-13723

Indexfelder

author_variant	l c lc h f hf z l zl
matchkey_str	article:09208542:2023----::rtandutlvlueewrbsdniinodtoeraoignblnaatninfre
hierarchy_sort_str	2023
publishDate	2023
allfields	10.1007/s11227-023-05195-2 doi (DE-627)OLC2143981376 (DE-He213)s11227-023-05195-2-p DE-627 ger DE-627 rakwb eng 004 620 VZ Cai, Linqin verfasserin aut Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering 2023 Text txt rdacontent ohne Hilfsmittel zu benutzen n rdamedia Band nc rdacarrier © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. Abstract Current Medical Image Visual Question Answering (Med-VQA) models often tend to exploit language bias instead of learning the multimodal features from both vision and language, which often suffers from the sparse data and bad performance. In this paper, we propose a new pre-trained multilevel fusion network based on Vision-conditioned reasoning and Bilinear attentions for Med-VQA (VB-MVQA). To augment vision data, we firstly incorporate Contrastive Language-Image Pre-training (CLIP) and attention mechanisms for effectively extracting medical image features. And then, the proposed VB-MVQA model applies multiple stacked attention layers and Bilinear Attention Network (BAN) to fuse the extracted image features and the question features extracted by Bidirectional Long Short-Term Memory(Bi-LSTM). On this basis, the proposed VB-MVQA model introduces vision-conditioned reasoning to guide the importance selection over multimodal fused features and further enhance the image semantic information for eliminating the language bias. Extensive experiments on three public benchmark datasets (VQA-RAD, SLAKE, and VQA-Med-2019) show that the proposed model outperforms state-of-the-art models by an average improvement of 11.08%, 5.28%, and 8.30%, and our proposed method achieves more significant accuracy than the baseline models for open-ended questions and more powerful for language-bias Med-VQA datasets. Medical image visual question answering Vision conditional reasoning Contrastive language-image pre-training Transfer learning Fang, Haodu aut Li, Zhiqing aut Enthalten in The journal of supercomputing Springer US, 1987 79(2023), 12 vom: 29. März, Seite 13696-13723 (DE-627)13046466X (DE-600)740510-8 (DE-576)018667775 0920-8542 nnns volume:79 year:2023 number:12 day:29 month:03 pages:13696-13723 https://doi.org/10.1007/s11227-023-05195-2 lizenzpflichtig Volltext GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-TEC SSG-OLC-MAT AR 79 2023 12 29 03 13696-13723
spelling	10.1007/s11227-023-05195-2 doi (DE-627)OLC2143981376 (DE-He213)s11227-023-05195-2-p DE-627 ger DE-627 rakwb eng 004 620 VZ Cai, Linqin verfasserin aut Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering 2023 Text txt rdacontent ohne Hilfsmittel zu benutzen n rdamedia Band nc rdacarrier © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. Abstract Current Medical Image Visual Question Answering (Med-VQA) models often tend to exploit language bias instead of learning the multimodal features from both vision and language, which often suffers from the sparse data and bad performance. In this paper, we propose a new pre-trained multilevel fusion network based on Vision-conditioned reasoning and Bilinear attentions for Med-VQA (VB-MVQA). To augment vision data, we firstly incorporate Contrastive Language-Image Pre-training (CLIP) and attention mechanisms for effectively extracting medical image features. And then, the proposed VB-MVQA model applies multiple stacked attention layers and Bilinear Attention Network (BAN) to fuse the extracted image features and the question features extracted by Bidirectional Long Short-Term Memory(Bi-LSTM). On this basis, the proposed VB-MVQA model introduces vision-conditioned reasoning to guide the importance selection over multimodal fused features and further enhance the image semantic information for eliminating the language bias. Extensive experiments on three public benchmark datasets (VQA-RAD, SLAKE, and VQA-Med-2019) show that the proposed model outperforms state-of-the-art models by an average improvement of 11.08%, 5.28%, and 8.30%, and our proposed method achieves more significant accuracy than the baseline models for open-ended questions and more powerful for language-bias Med-VQA datasets. Medical image visual question answering Vision conditional reasoning Contrastive language-image pre-training Transfer learning Fang, Haodu aut Li, Zhiqing aut Enthalten in The journal of supercomputing Springer US, 1987 79(2023), 12 vom: 29. März, Seite 13696-13723 (DE-627)13046466X (DE-600)740510-8 (DE-576)018667775 0920-8542 nnns volume:79 year:2023 number:12 day:29 month:03 pages:13696-13723 https://doi.org/10.1007/s11227-023-05195-2 lizenzpflichtig Volltext GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-TEC SSG-OLC-MAT AR 79 2023 12 29 03 13696-13723
allfields_unstemmed	10.1007/s11227-023-05195-2 doi (DE-627)OLC2143981376 (DE-He213)s11227-023-05195-2-p DE-627 ger DE-627 rakwb eng 004 620 VZ Cai, Linqin verfasserin aut Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering 2023 Text txt rdacontent ohne Hilfsmittel zu benutzen n rdamedia Band nc rdacarrier © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. Abstract Current Medical Image Visual Question Answering (Med-VQA) models often tend to exploit language bias instead of learning the multimodal features from both vision and language, which often suffers from the sparse data and bad performance. In this paper, we propose a new pre-trained multilevel fusion network based on Vision-conditioned reasoning and Bilinear attentions for Med-VQA (VB-MVQA). To augment vision data, we firstly incorporate Contrastive Language-Image Pre-training (CLIP) and attention mechanisms for effectively extracting medical image features. And then, the proposed VB-MVQA model applies multiple stacked attention layers and Bilinear Attention Network (BAN) to fuse the extracted image features and the question features extracted by Bidirectional Long Short-Term Memory(Bi-LSTM). On this basis, the proposed VB-MVQA model introduces vision-conditioned reasoning to guide the importance selection over multimodal fused features and further enhance the image semantic information for eliminating the language bias. Extensive experiments on three public benchmark datasets (VQA-RAD, SLAKE, and VQA-Med-2019) show that the proposed model outperforms state-of-the-art models by an average improvement of 11.08%, 5.28%, and 8.30%, and our proposed method achieves more significant accuracy than the baseline models for open-ended questions and more powerful for language-bias Med-VQA datasets. Medical image visual question answering Vision conditional reasoning Contrastive language-image pre-training Transfer learning Fang, Haodu aut Li, Zhiqing aut Enthalten in The journal of supercomputing Springer US, 1987 79(2023), 12 vom: 29. März, Seite 13696-13723 (DE-627)13046466X (DE-600)740510-8 (DE-576)018667775 0920-8542 nnns volume:79 year:2023 number:12 day:29 month:03 pages:13696-13723 https://doi.org/10.1007/s11227-023-05195-2 lizenzpflichtig Volltext GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-TEC SSG-OLC-MAT AR 79 2023 12 29 03 13696-13723
allfieldsGer	10.1007/s11227-023-05195-2 doi (DE-627)OLC2143981376 (DE-He213)s11227-023-05195-2-p DE-627 ger DE-627 rakwb eng 004 620 VZ Cai, Linqin verfasserin aut Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering 2023 Text txt rdacontent ohne Hilfsmittel zu benutzen n rdamedia Band nc rdacarrier © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. Abstract Current Medical Image Visual Question Answering (Med-VQA) models often tend to exploit language bias instead of learning the multimodal features from both vision and language, which often suffers from the sparse data and bad performance. In this paper, we propose a new pre-trained multilevel fusion network based on Vision-conditioned reasoning and Bilinear attentions for Med-VQA (VB-MVQA). To augment vision data, we firstly incorporate Contrastive Language-Image Pre-training (CLIP) and attention mechanisms for effectively extracting medical image features. And then, the proposed VB-MVQA model applies multiple stacked attention layers and Bilinear Attention Network (BAN) to fuse the extracted image features and the question features extracted by Bidirectional Long Short-Term Memory(Bi-LSTM). On this basis, the proposed VB-MVQA model introduces vision-conditioned reasoning to guide the importance selection over multimodal fused features and further enhance the image semantic information for eliminating the language bias. Extensive experiments on three public benchmark datasets (VQA-RAD, SLAKE, and VQA-Med-2019) show that the proposed model outperforms state-of-the-art models by an average improvement of 11.08%, 5.28%, and 8.30%, and our proposed method achieves more significant accuracy than the baseline models for open-ended questions and more powerful for language-bias Med-VQA datasets. Medical image visual question answering Vision conditional reasoning Contrastive language-image pre-training Transfer learning Fang, Haodu aut Li, Zhiqing aut Enthalten in The journal of supercomputing Springer US, 1987 79(2023), 12 vom: 29. März, Seite 13696-13723 (DE-627)13046466X (DE-600)740510-8 (DE-576)018667775 0920-8542 nnns volume:79 year:2023 number:12 day:29 month:03 pages:13696-13723 https://doi.org/10.1007/s11227-023-05195-2 lizenzpflichtig Volltext GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-TEC SSG-OLC-MAT AR 79 2023 12 29 03 13696-13723
allfieldsSound	10.1007/s11227-023-05195-2 doi (DE-627)OLC2143981376 (DE-He213)s11227-023-05195-2-p DE-627 ger DE-627 rakwb eng 004 620 VZ Cai, Linqin verfasserin aut Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering 2023 Text txt rdacontent ohne Hilfsmittel zu benutzen n rdamedia Band nc rdacarrier © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. Abstract Current Medical Image Visual Question Answering (Med-VQA) models often tend to exploit language bias instead of learning the multimodal features from both vision and language, which often suffers from the sparse data and bad performance. In this paper, we propose a new pre-trained multilevel fusion network based on Vision-conditioned reasoning and Bilinear attentions for Med-VQA (VB-MVQA). To augment vision data, we firstly incorporate Contrastive Language-Image Pre-training (CLIP) and attention mechanisms for effectively extracting medical image features. And then, the proposed VB-MVQA model applies multiple stacked attention layers and Bilinear Attention Network (BAN) to fuse the extracted image features and the question features extracted by Bidirectional Long Short-Term Memory(Bi-LSTM). On this basis, the proposed VB-MVQA model introduces vision-conditioned reasoning to guide the importance selection over multimodal fused features and further enhance the image semantic information for eliminating the language bias. Extensive experiments on three public benchmark datasets (VQA-RAD, SLAKE, and VQA-Med-2019) show that the proposed model outperforms state-of-the-art models by an average improvement of 11.08%, 5.28%, and 8.30%, and our proposed method achieves more significant accuracy than the baseline models for open-ended questions and more powerful for language-bias Med-VQA datasets. Medical image visual question answering Vision conditional reasoning Contrastive language-image pre-training Transfer learning Fang, Haodu aut Li, Zhiqing aut Enthalten in The journal of supercomputing Springer US, 1987 79(2023), 12 vom: 29. März, Seite 13696-13723 (DE-627)13046466X (DE-600)740510-8 (DE-576)018667775 0920-8542 nnns volume:79 year:2023 number:12 day:29 month:03 pages:13696-13723 https://doi.org/10.1007/s11227-023-05195-2 lizenzpflichtig Volltext GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-TEC SSG-OLC-MAT AR 79 2023 12 29 03 13696-13723
language	English
source	Enthalten in The journal of supercomputing 79(2023), 12 vom: 29. März, Seite 13696-13723 volume:79 year:2023 number:12 day:29 month:03 pages:13696-13723
sourceStr	Enthalten in The journal of supercomputing 79(2023), 12 vom: 29. März, Seite 13696-13723 volume:79 year:2023 number:12 day:29 month:03 pages:13696-13723
format_phy_str_mv	Article
institution	findex.gbv.de
topic_facet	Medical image visual question answering Vision conditional reasoning Contrastive language-image pre-training Transfer learning
dewey-raw	004
isfreeaccess_bool	false
container_title	The journal of supercomputing
authorswithroles_txt_mv	Cai, Linqin @@aut@@ Fang, Haodu @@aut@@ Li, Zhiqing @@aut@@
publishDateDaySort_date	2023-03-29T00:00:00Z
hierarchy_top_id	13046466X
dewey-sort	14
id	OLC2143981376
language_de	englisch
fullrecord	<?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>01000naa a22002652 4500</leader><controlfield tag="001">OLC2143981376</controlfield><controlfield tag="003">DE-627</controlfield><controlfield tag="005">20240118092728.0</controlfield><controlfield tag="007">tu</controlfield><controlfield tag="008">240118s2023 xx \|\|\|\|\| 00\| \|\|eng c</controlfield><datafield tag="024" ind1="7" ind2=" "><subfield code="a">10.1007/s11227-023-05195-2</subfield><subfield code="2">doi</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-627)OLC2143981376</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-He213)s11227-023-05195-2-p</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-627</subfield><subfield code="b">ger</subfield><subfield code="c">DE-627</subfield><subfield code="e">rakwb</subfield></datafield><datafield tag="041" ind1=" " ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="082" ind1="0" ind2="4"><subfield code="a">004</subfield><subfield code="a">620</subfield><subfield code="q">VZ</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Cai, Linqin</subfield><subfield code="e">verfasserin</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="c">2023</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="a">Text</subfield><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="a">ohne Hilfsmittel zu benutzen</subfield><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="a">Band</subfield><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="500" ind1=" " ind2=" "><subfield code="a">© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.</subfield></datafield><datafield tag="520" ind1=" " ind2=" "><subfield code="a">Abstract Current Medical Image Visual Question Answering (Med-VQA) models often tend to exploit language bias instead of learning the multimodal features from both vision and language, which often suffers from the sparse data and bad performance. In this paper, we propose a new pre-trained multilevel fusion network based on Vision-conditioned reasoning and Bilinear attentions for Med-VQA (VB-MVQA). To augment vision data, we firstly incorporate Contrastive Language-Image Pre-training (CLIP) and attention mechanisms for effectively extracting medical image features. And then, the proposed VB-MVQA model applies multiple stacked attention layers and Bilinear Attention Network (BAN) to fuse the extracted image features and the question features extracted by Bidirectional Long Short-Term Memory(Bi-LSTM). On this basis, the proposed VB-MVQA model introduces vision-conditioned reasoning to guide the importance selection over multimodal fused features and further enhance the image semantic information for eliminating the language bias. Extensive experiments on three public benchmark datasets (VQA-RAD, SLAKE, and VQA-Med-2019) show that the proposed model outperforms state-of-the-art models by an average improvement of 11.08%, 5.28%, and 8.30%, and our proposed method achieves more significant accuracy than the baseline models for open-ended questions and more powerful for language-bias Med-VQA datasets.</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Medical image visual question answering</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Vision conditional reasoning</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Contrastive language-image pre-training</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Transfer learning</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Fang, Haodu</subfield><subfield code="4">aut</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Li, Zhiqing</subfield><subfield code="4">aut</subfield></datafield><datafield tag="773" ind1="0" ind2="8"><subfield code="i">Enthalten in</subfield><subfield code="t">The journal of supercomputing</subfield><subfield code="d">Springer US, 1987</subfield><subfield code="g">79(2023), 12 vom: 29. März, Seite 13696-13723</subfield><subfield code="w">(DE-627)13046466X</subfield><subfield code="w">(DE-600)740510-8</subfield><subfield code="w">(DE-576)018667775</subfield><subfield code="x">0920-8542</subfield><subfield code="7">nnns</subfield></datafield><datafield tag="773" ind1="1" ind2="8"><subfield code="g">volume:79</subfield><subfield code="g">year:2023</subfield><subfield code="g">number:12</subfield><subfield code="g">day:29</subfield><subfield code="g">month:03</subfield><subfield code="g">pages:13696-13723</subfield></datafield><datafield tag="856" ind1="4" ind2="1"><subfield code="u">https://doi.org/10.1007/s11227-023-05195-2</subfield><subfield code="z">lizenzpflichtig</subfield><subfield code="3">Volltext</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_USEFLAG_A</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SYSFLAG_A</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_OLC</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SSG-OLC-TEC</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SSG-OLC-MAT</subfield></datafield><datafield tag="951" ind1=" " ind2=" "><subfield code="a">AR</subfield></datafield><datafield tag="952" ind1=" " ind2=" "><subfield code="d">79</subfield><subfield code="j">2023</subfield><subfield code="e">12</subfield><subfield code="b">29</subfield><subfield code="c">03</subfield><subfield code="h">13696-13723</subfield></datafield></record></collection>
author	Cai, Linqin
spellingShingle	Cai, Linqin ddc 004 misc Medical image visual question answering misc Vision conditional reasoning misc Contrastive language-image pre-training misc Transfer learning Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering
authorStr	Cai, Linqin
ppnlink_with_tag_str_mv	@@773@@(DE-627)13046466X
format	Article
dewey-ones	004 - Data processing & computer science 620 - Engineering & allied operations
delete_txt_mv	keep
author_role	aut aut aut
collection	OLC
remote_str	false
illustrated	Not Illustrated
issn	0920-8542
topic_title	004 620 VZ Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering Medical image visual question answering Vision conditional reasoning Contrastive language-image pre-training Transfer learning
topic	ddc 004 misc Medical image visual question answering misc Vision conditional reasoning misc Contrastive language-image pre-training misc Transfer learning
topic_unstemmed	ddc 004 misc Medical image visual question answering misc Vision conditional reasoning misc Contrastive language-image pre-training misc Transfer learning
topic_browse	ddc 004 misc Medical image visual question answering misc Vision conditional reasoning misc Contrastive language-image pre-training misc Transfer learning
format_facet	Aufsätze Gedruckte Aufsätze
format_main_str_mv	Text Zeitschrift/Artikel
carriertype_str_mv	nc
hierarchy_parent_title	The journal of supercomputing
hierarchy_parent_id	13046466X
dewey-tens	000 - Computer science, knowledge & systems 620 - Engineering
hierarchy_top_title	The journal of supercomputing
isfreeaccess_txt	false
familylinks_str_mv	(DE-627)13046466X (DE-600)740510-8 (DE-576)018667775
title	Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering
ctrlnum	(DE-627)OLC2143981376 (DE-He213)s11227-023-05195-2-p
title_full	Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering
author_sort	Cai, Linqin
journal	The journal of supercomputing
journalStr	The journal of supercomputing
lang_code	eng
isOA_bool	false
dewey-hundreds	000 - Computer science, information & general works 600 - Technology
recordtype	marc
publishDateSort	2023
contenttype_str_mv	txt
container_start_page	13696
author_browse	Cai, Linqin Fang, Haodu Li, Zhiqing
container_volume	79
class	004 620 VZ
format_se	Aufsätze
author-letter	Cai, Linqin
doi_str_mv	10.1007/s11227-023-05195-2
dewey-full	004 620
title_sort	pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering
title_auth	Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering
abstract	Abstract Current Medical Image Visual Question Answering (Med-VQA) models often tend to exploit language bias instead of learning the multimodal features from both vision and language, which often suffers from the sparse data and bad performance. In this paper, we propose a new pre-trained multilevel fusion network based on Vision-conditioned reasoning and Bilinear attentions for Med-VQA (VB-MVQA). To augment vision data, we firstly incorporate Contrastive Language-Image Pre-training (CLIP) and attention mechanisms for effectively extracting medical image features. And then, the proposed VB-MVQA model applies multiple stacked attention layers and Bilinear Attention Network (BAN) to fuse the extracted image features and the question features extracted by Bidirectional Long Short-Term Memory(Bi-LSTM). On this basis, the proposed VB-MVQA model introduces vision-conditioned reasoning to guide the importance selection over multimodal fused features and further enhance the image semantic information for eliminating the language bias. Extensive experiments on three public benchmark datasets (VQA-RAD, SLAKE, and VQA-Med-2019) show that the proposed model outperforms state-of-the-art models by an average improvement of 11.08%, 5.28%, and 8.30%, and our proposed method achieves more significant accuracy than the baseline models for open-ended questions and more powerful for language-bias Med-VQA datasets. © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
abstractGer	Abstract Current Medical Image Visual Question Answering (Med-VQA) models often tend to exploit language bias instead of learning the multimodal features from both vision and language, which often suffers from the sparse data and bad performance. In this paper, we propose a new pre-trained multilevel fusion network based on Vision-conditioned reasoning and Bilinear attentions for Med-VQA (VB-MVQA). To augment vision data, we firstly incorporate Contrastive Language-Image Pre-training (CLIP) and attention mechanisms for effectively extracting medical image features. And then, the proposed VB-MVQA model applies multiple stacked attention layers and Bilinear Attention Network (BAN) to fuse the extracted image features and the question features extracted by Bidirectional Long Short-Term Memory(Bi-LSTM). On this basis, the proposed VB-MVQA model introduces vision-conditioned reasoning to guide the importance selection over multimodal fused features and further enhance the image semantic information for eliminating the language bias. Extensive experiments on three public benchmark datasets (VQA-RAD, SLAKE, and VQA-Med-2019) show that the proposed model outperforms state-of-the-art models by an average improvement of 11.08%, 5.28%, and 8.30%, and our proposed method achieves more significant accuracy than the baseline models for open-ended questions and more powerful for language-bias Med-VQA datasets. © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
abstract_unstemmed	Abstract Current Medical Image Visual Question Answering (Med-VQA) models often tend to exploit language bias instead of learning the multimodal features from both vision and language, which often suffers from the sparse data and bad performance. In this paper, we propose a new pre-trained multilevel fusion network based on Vision-conditioned reasoning and Bilinear attentions for Med-VQA (VB-MVQA). To augment vision data, we firstly incorporate Contrastive Language-Image Pre-training (CLIP) and attention mechanisms for effectively extracting medical image features. And then, the proposed VB-MVQA model applies multiple stacked attention layers and Bilinear Attention Network (BAN) to fuse the extracted image features and the question features extracted by Bidirectional Long Short-Term Memory(Bi-LSTM). On this basis, the proposed VB-MVQA model introduces vision-conditioned reasoning to guide the importance selection over multimodal fused features and further enhance the image semantic information for eliminating the language bias. Extensive experiments on three public benchmark datasets (VQA-RAD, SLAKE, and VQA-Med-2019) show that the proposed model outperforms state-of-the-art models by an average improvement of 11.08%, 5.28%, and 8.30%, and our proposed method achieves more significant accuracy than the baseline models for open-ended questions and more powerful for language-bias Med-VQA datasets. © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
collection_details	GBV_USEFLAG_A SYSFLAG_A GBV_OLC SSG-OLC-TEC SSG-OLC-MAT
container_issue	12
title_short	Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering
url	https://doi.org/10.1007/s11227-023-05195-2
remote_bool	false
author2	Fang, Haodu Li, Zhiqing
author2Str	Fang, Haodu Li, Zhiqing
ppnlink	13046466X
mediatype_str_mv	n
isOA_txt	false
hochschulschrift_bool	false
doi_str	10.1007/s11227-023-05195-2
up_date	2024-07-03T19:20:22.510Z
_version_	1803586814618894336
fullrecord_marcxml	<?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>01000naa a22002652 4500</leader><controlfield tag="001">OLC2143981376</controlfield><controlfield tag="003">DE-627</controlfield><controlfield tag="005">20240118092728.0</controlfield><controlfield tag="007">tu</controlfield><controlfield tag="008">240118s2023 xx \|\|\|\|\| 00\| \|\|eng c</controlfield><datafield tag="024" ind1="7" ind2=" "><subfield code="a">10.1007/s11227-023-05195-2</subfield><subfield code="2">doi</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-627)OLC2143981376</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-He213)s11227-023-05195-2-p</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-627</subfield><subfield code="b">ger</subfield><subfield code="c">DE-627</subfield><subfield code="e">rakwb</subfield></datafield><datafield tag="041" ind1=" " ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="082" ind1="0" ind2="4"><subfield code="a">004</subfield><subfield code="a">620</subfield><subfield code="q">VZ</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Cai, Linqin</subfield><subfield code="e">verfasserin</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="c">2023</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="a">Text</subfield><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="a">ohne Hilfsmittel zu benutzen</subfield><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="a">Band</subfield><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="500" ind1=" " ind2=" "><subfield code="a">© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.</subfield></datafield><datafield tag="520" ind1=" " ind2=" "><subfield code="a">Abstract Current Medical Image Visual Question Answering (Med-VQA) models often tend to exploit language bias instead of learning the multimodal features from both vision and language, which often suffers from the sparse data and bad performance. In this paper, we propose a new pre-trained multilevel fusion network based on Vision-conditioned reasoning and Bilinear attentions for Med-VQA (VB-MVQA). To augment vision data, we firstly incorporate Contrastive Language-Image Pre-training (CLIP) and attention mechanisms for effectively extracting medical image features. And then, the proposed VB-MVQA model applies multiple stacked attention layers and Bilinear Attention Network (BAN) to fuse the extracted image features and the question features extracted by Bidirectional Long Short-Term Memory(Bi-LSTM). On this basis, the proposed VB-MVQA model introduces vision-conditioned reasoning to guide the importance selection over multimodal fused features and further enhance the image semantic information for eliminating the language bias. Extensive experiments on three public benchmark datasets (VQA-RAD, SLAKE, and VQA-Med-2019) show that the proposed model outperforms state-of-the-art models by an average improvement of 11.08%, 5.28%, and 8.30%, and our proposed method achieves more significant accuracy than the baseline models for open-ended questions and more powerful for language-bias Med-VQA datasets.</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Medical image visual question answering</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Vision conditional reasoning</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Contrastive language-image pre-training</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Transfer learning</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Fang, Haodu</subfield><subfield code="4">aut</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Li, Zhiqing</subfield><subfield code="4">aut</subfield></datafield><datafield tag="773" ind1="0" ind2="8"><subfield code="i">Enthalten in</subfield><subfield code="t">The journal of supercomputing</subfield><subfield code="d">Springer US, 1987</subfield><subfield code="g">79(2023), 12 vom: 29. März, Seite 13696-13723</subfield><subfield code="w">(DE-627)13046466X</subfield><subfield code="w">(DE-600)740510-8</subfield><subfield code="w">(DE-576)018667775</subfield><subfield code="x">0920-8542</subfield><subfield code="7">nnns</subfield></datafield><datafield tag="773" ind1="1" ind2="8"><subfield code="g">volume:79</subfield><subfield code="g">year:2023</subfield><subfield code="g">number:12</subfield><subfield code="g">day:29</subfield><subfield code="g">month:03</subfield><subfield code="g">pages:13696-13723</subfield></datafield><datafield tag="856" ind1="4" ind2="1"><subfield code="u">https://doi.org/10.1007/s11227-023-05195-2</subfield><subfield code="z">lizenzpflichtig</subfield><subfield code="3">Volltext</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_USEFLAG_A</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SYSFLAG_A</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">GBV_OLC</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SSG-OLC-TEC</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">SSG-OLC-MAT</subfield></datafield><datafield tag="951" ind1=" " ind2=" "><subfield code="a">AR</subfield></datafield><datafield tag="952" ind1=" " ind2=" "><subfield code="d">79</subfield><subfield code="j">2023</subfield><subfield code="e">12</subfield><subfield code="b">29</subfield><subfield code="c">03</subfield><subfield code="h">13696-13723</subfield></datafield></record></collection>
score	7.399617

Nicht das Richtige dabei?

Schreiben Sie uns!

Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering

Nicht das Richtige dabei?

Zugang & Verfügbarkeit

Vorhandene Bände

Nicht das Richtige dabei?