Matches in SemOpenAlex for { <https://semopenalex.org/work/W4313178921> ?p ?o ?g. }
- W4313178921 abstract "Visual appearance is considered to be the most important cue to understand images for cross-modal retrieval, while sometimes the scene text appearing in images can provide valuable information to understand the visual semantics. Most of existing cross-modal retrieval approaches ignore the usage of scene text information and directly adding this information may lead to performance degradation in scene text free scenarios. To address this issue, we propose a full transformer architecture to unify these cross-modal retrieval scenarios in a single Vision and Scene Text Aggregation framework (ViSTA). Specifically, ViSTA utilizes transformer blocks to directly encode image patches and fuse scene text embedding to learn an aggregated visual representation for cross-modal retrieval. To tackle the modality missing problem of scene text, we propose a novel fusion token based transformer aggregation approach to exchange the necessary scene text information only through the fusion token and concentrate on the most important features in each modality. To further strengthen the visual modality, we develop dual contrastive learning losses to embed both image-text pairs and fusion-text pairs into a common cross-modal space. Compared to existing methods, ViSTA enables to aggregate relevant scene text semantics with visual appearance, and hence improve results under both scene text free and scene text aware scenarios. Experimental results show that ViSTA outperforms other methods by at least 8.4% at Recall@ 1 for scene text aware retrieval task. Compared with state-of-the-art scene text free retrieval methods, ViSTA can achieve better accuracy on Flicker30K and MSCOCO while running at least three times faster during the inference stage, which validates the effectiveness of the proposed framework." @default.
- W4313178921 created "2023-01-06" @default.
- W4313178921 creator A5008752777 @default.
- W4313178921 creator A5010479652 @default.
- W4313178921 creator A5016199651 @default.
- W4313178921 creator A5050031109 @default.
- W4313178921 creator A5051264771 @default.
- W4313178921 creator A5051375991 @default.
- W4313178921 creator A5062900458 @default.
- W4313178921 creator A5071772212 @default.
- W4313178921 creator A5075880303 @default.
- W4313178921 creator A5076936566 @default.
- W4313178921 creator A5080777665 @default.
- W4313178921 date "2022-06-01" @default.
- W4313178921 modified "2023-10-06" @default.
- W4313178921 title "ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval" @default.
- W4313178921 cites W2185175083 @default.
- W4313178921 cites W2277195237 @default.
- W4313178921 cites W2481240925 @default.
- W4313178921 cites W2727586675 @default.
- W4313178921 cites W2886641317 @default.
- W4313178921 cites W2888894220 @default.
- W4313178921 cites W2979382951 @default.
- W4313178921 cites W2988326850 @default.
- W4313178921 cites W2988823324 @default.
- W4313178921 cites W2998356391 @default.
- W4313178921 cites W3034336960 @default.
- W4313178921 cites W3034727271 @default.
- W4313178921 cites W3035454331 @default.
- W4313178921 cites W3035605030 @default.
- W4313178921 cites W3110661548 @default.
- W4313178921 cites W3118694826 @default.
- W4313178921 cites W3120581333 @default.
- W4313178921 cites W3168433561 @default.
- W4313178921 cites W3171668871 @default.
- W4313178921 cites W3173220247 @default.
- W4313178921 cites W3173909648 @default.
- W4313178921 cites W3175888430 @default.
- W4313178921 cites W3177167102 @default.
- W4313178921 cites W3181159501 @default.
- W4313178921 cites W3184784418 @default.
- W4313178921 doi "https://doi.org/10.1109/cvpr52688.2022.00512" @default.
- W4313178921 hasPublicationYear "2022" @default.
- W4313178921 type Work @default.
- W4313178921 citedByCount "11" @default.
- W4313178921 countsByYear W43131789212022 @default.
- W4313178921 countsByYear W43131789212023 @default.
- W4313178921 crossrefType "proceedings-article" @default.
- W4313178921 hasAuthorship W4313178921A5008752777 @default.
- W4313178921 hasAuthorship W4313178921A5010479652 @default.
- W4313178921 hasAuthorship W4313178921A5016199651 @default.
- W4313178921 hasAuthorship W4313178921A5050031109 @default.
- W4313178921 hasAuthorship W4313178921A5051264771 @default.
- W4313178921 hasAuthorship W4313178921A5051375991 @default.
- W4313178921 hasAuthorship W4313178921A5062900458 @default.
- W4313178921 hasAuthorship W4313178921A5071772212 @default.
- W4313178921 hasAuthorship W4313178921A5075880303 @default.
- W4313178921 hasAuthorship W4313178921A5076936566 @default.
- W4313178921 hasAuthorship W4313178921A5080777665 @default.
- W4313178921 hasBestOaLocation W43131789212 @default.
- W4313178921 hasConcept C121332964 @default.
- W4313178921 hasConcept C153180895 @default.
- W4313178921 hasConcept C154945302 @default.
- W4313178921 hasConcept C165801399 @default.
- W4313178921 hasConcept C185592680 @default.
- W4313178921 hasConcept C188027245 @default.
- W4313178921 hasConcept C204321447 @default.
- W4313178921 hasConcept C23123220 @default.
- W4313178921 hasConcept C2780226545 @default.
- W4313178921 hasConcept C31972630 @default.
- W4313178921 hasConcept C38652104 @default.
- W4313178921 hasConcept C41008148 @default.
- W4313178921 hasConcept C48145219 @default.
- W4313178921 hasConcept C62520636 @default.
- W4313178921 hasConcept C66322947 @default.
- W4313178921 hasConcept C71139939 @default.
- W4313178921 hasConceptScore W4313178921C121332964 @default.
- W4313178921 hasConceptScore W4313178921C153180895 @default.
- W4313178921 hasConceptScore W4313178921C154945302 @default.
- W4313178921 hasConceptScore W4313178921C165801399 @default.
- W4313178921 hasConceptScore W4313178921C185592680 @default.
- W4313178921 hasConceptScore W4313178921C188027245 @default.
- W4313178921 hasConceptScore W4313178921C204321447 @default.
- W4313178921 hasConceptScore W4313178921C23123220 @default.
- W4313178921 hasConceptScore W4313178921C2780226545 @default.
- W4313178921 hasConceptScore W4313178921C31972630 @default.
- W4313178921 hasConceptScore W4313178921C38652104 @default.
- W4313178921 hasConceptScore W4313178921C41008148 @default.
- W4313178921 hasConceptScore W4313178921C48145219 @default.
- W4313178921 hasConceptScore W4313178921C62520636 @default.
- W4313178921 hasConceptScore W4313178921C66322947 @default.
- W4313178921 hasConceptScore W4313178921C71139939 @default.
- W4313178921 hasLocation W43131789211 @default.
- W4313178921 hasLocation W43131789212 @default.
- W4313178921 hasOpenAccess W4313178921 @default.
- W4313178921 hasPrimaryLocation W43131789211 @default.
- W4313178921 hasRelatedWork W1891287906 @default.
- W4313178921 hasRelatedWork W1969923398 @default.
- W4313178921 hasRelatedWork W2036807459 @default.
- W4313178921 hasRelatedWork W2166024367 @default.