Matches in SemOpenAlex for { <https://semopenalex.org/work/W4387675775> ?p ?o ?g. }
- W4387675775 endingPage "104840" @default.
- W4387675775 startingPage "104840" @default.
- W4387675775 abstract "Visual Question Answering (VQA) is a task that requires VQA model to fully understand the visual information of the image and the language information of the question, and then combine both to provide an answer. Recently, a large amount of VQA approaches focus on modeling intra- and inter-modal interactions with respect to vision and language using a deep modular co-attention network, which can achieve a good performance. Despite their benefits, they also have their limitations. First, the question representation is obtained through Glove word embeddings and Recurrent Neural Network, which may not be sufficient to capture the intricate semantics of the question features. Second, they mostly use visual appearance features extracted by Faster R-CNN to interact with language features, and they ignore important spatial relations between objects in images, resulting in incomplete use of image information. To overcome the limitations of previous methods, we propose a novel Multi-modal Spatial Relation Attention Network (MSRAN) for VQA, which can introduce spatial relationships between objects to fully utilize the image information, thus improving the performance of VQA. In order to achieve the above, we design two types of spatial relational attention modules to comprehensively explore the attention schemes: (i) Self-Attention based on Explicit Spatial Relation (SA-ESR) module that explores geometric relationships between objects explicitly; and (ii) Self-Attention based on Implicit Spatial Relation (SA-ISR) module that can capture the hidden dynamic relationships between objects by using spatial relationship. Moreover, the pre-training model BERT, which replaces Glove word embeddings and Recurrent Neural Network, is applied to MSRAN in order to obtain the better question representation. Extensive experiments on two large benchmark datasets, VQA 2.0 and GQA, demonstrate that our proposed model achieves the state-of-the-art performance." @default.
- W4387675775 created "2023-10-17" @default.
- W4387675775 creator A5037488697 @default.
- W4387675775 creator A5043245528 @default.
- W4387675775 creator A5058091387 @default.
- W4387675775 creator A5073626715 @default.
- W4387675775 creator A5087409717 @default.
- W4387675775 creator A5088762168 @default.
- W4387675775 date "2023-10-01" @default.
- W4387675775 modified "2023-10-17" @default.
- W4387675775 title "Multi-modal spatial relational attention networks for visual question answering" @default.
- W4387675775 cites W2064675550 @default.
- W4387675775 cites W2277195237 @default.
- W4387675775 cites W3037011828 @default.
- W4387675775 cites W3037773948 @default.
- W4387675775 cites W3045725402 @default.
- W4387675775 cites W3047320246 @default.
- W4387675775 cites W3141200244 @default.
- W4387675775 cites W3195797850 @default.
- W4387675775 cites W3197739681 @default.
- W4387675775 cites W3205857725 @default.
- W4387675775 cites W3209017365 @default.
- W4387675775 cites W4210836196 @default.
- W4387675775 cites W4224217283 @default.
- W4387675775 cites W4307724164 @default.
- W4387675775 cites W4379383633 @default.
- W4387675775 cites W4381328259 @default.
- W4387675775 cites W4382468300 @default.
- W4387675775 doi "https://doi.org/10.1016/j.imavis.2023.104840" @default.
- W4387675775 hasPublicationYear "2023" @default.
- W4387675775 type Work @default.
- W4387675775 citedByCount "0" @default.
- W4387675775 crossrefType "journal-article" @default.
- W4387675775 hasAuthorship W4387675775A5037488697 @default.
- W4387675775 hasAuthorship W4387675775A5043245528 @default.
- W4387675775 hasAuthorship W4387675775A5058091387 @default.
- W4387675775 hasAuthorship W4387675775A5073626715 @default.
- W4387675775 hasAuthorship W4387675775A5087409717 @default.
- W4387675775 hasAuthorship W4387675775A5088762168 @default.
- W4387675775 hasConcept C101468663 @default.
- W4387675775 hasConcept C105795698 @default.
- W4387675775 hasConcept C111919701 @default.
- W4387675775 hasConcept C120665830 @default.
- W4387675775 hasConcept C121332964 @default.
- W4387675775 hasConcept C124101348 @default.
- W4387675775 hasConcept C138885662 @default.
- W4387675775 hasConcept C153180895 @default.
- W4387675775 hasConcept C154945302 @default.
- W4387675775 hasConcept C159620131 @default.
- W4387675775 hasConcept C17744445 @default.
- W4387675775 hasConcept C184337299 @default.
- W4387675775 hasConcept C185592680 @default.
- W4387675775 hasConcept C188027245 @default.
- W4387675775 hasConcept C192209626 @default.
- W4387675775 hasConcept C199360897 @default.
- W4387675775 hasConcept C199539241 @default.
- W4387675775 hasConcept C204321447 @default.
- W4387675775 hasConcept C25343380 @default.
- W4387675775 hasConcept C27511587 @default.
- W4387675775 hasConcept C2776359362 @default.
- W4387675775 hasConcept C33923547 @default.
- W4387675775 hasConcept C41008148 @default.
- W4387675775 hasConcept C41895202 @default.
- W4387675775 hasConcept C44291984 @default.
- W4387675775 hasConcept C71139939 @default.
- W4387675775 hasConcept C90805587 @default.
- W4387675775 hasConcept C94625758 @default.
- W4387675775 hasConceptScore W4387675775C101468663 @default.
- W4387675775 hasConceptScore W4387675775C105795698 @default.
- W4387675775 hasConceptScore W4387675775C111919701 @default.
- W4387675775 hasConceptScore W4387675775C120665830 @default.
- W4387675775 hasConceptScore W4387675775C121332964 @default.
- W4387675775 hasConceptScore W4387675775C124101348 @default.
- W4387675775 hasConceptScore W4387675775C138885662 @default.
- W4387675775 hasConceptScore W4387675775C153180895 @default.
- W4387675775 hasConceptScore W4387675775C154945302 @default.
- W4387675775 hasConceptScore W4387675775C159620131 @default.
- W4387675775 hasConceptScore W4387675775C17744445 @default.
- W4387675775 hasConceptScore W4387675775C184337299 @default.
- W4387675775 hasConceptScore W4387675775C185592680 @default.
- W4387675775 hasConceptScore W4387675775C188027245 @default.
- W4387675775 hasConceptScore W4387675775C192209626 @default.
- W4387675775 hasConceptScore W4387675775C199360897 @default.
- W4387675775 hasConceptScore W4387675775C199539241 @default.
- W4387675775 hasConceptScore W4387675775C204321447 @default.
- W4387675775 hasConceptScore W4387675775C25343380 @default.
- W4387675775 hasConceptScore W4387675775C27511587 @default.
- W4387675775 hasConceptScore W4387675775C2776359362 @default.
- W4387675775 hasConceptScore W4387675775C33923547 @default.
- W4387675775 hasConceptScore W4387675775C41008148 @default.
- W4387675775 hasConceptScore W4387675775C41895202 @default.
- W4387675775 hasConceptScore W4387675775C44291984 @default.
- W4387675775 hasConceptScore W4387675775C71139939 @default.
- W4387675775 hasConceptScore W4387675775C90805587 @default.
- W4387675775 hasConceptScore W4387675775C94625758 @default.
- W4387675775 hasLocation W43876757751 @default.
- W4387675775 hasOpenAccess W4387675775 @default.
- W4387675775 hasPrimaryLocation W43876757751 @default.