Matches in SemOpenAlex for { <https://semopenalex.org/work/W4386071468> ?p ?o ?g. }
Showing items 1 to 92 of
92
with 100 items per page.
- W4386071468 abstract "To build Video Question Answering (VideoQA) systems capable of assisting humans in daily activities, seeking answers from long-form videos with diverse and complex events is a must. Existing multi-modal VQA models achieve promising performance on images or short video clips, especially with the recent success of large-scale multi-modal pre-training. However, when extending these methods to long-form videos, new challenges arise. On the one hand, using a dense video sampling strategy is computationally prohibitive. On the other hand, methods relying on sparse sampling struggle in scenarios where multi-event and multi-granularity visual reasoning are required. In this work, we introduce a new model named <tex xmlns:mml=http://www.w3.org/1998/Math/MathML xmlns:xlink=http://www.w3.org/1999/xlink>$mathcal{M}ulti{-}$</tex> · modal Iterative <tex xmlns:mml=http://www.w3.org/1998/Math/MathML xmlns:xlink=http://www.w3.org/1999/xlink>$mathcal{S}$</tex> .patial-temporal Transformer <tex xmlns:mml=http://www.w3.org/1998/Math/MathML xmlns:xlink=http://www.w3.org/1999/xlink>$(mathcal{MIST})$</tex> ) to better adapt pre-trained models for long-form VideoQA. Specifically, <tex xmlns:mml=http://www.w3.org/1998/Math/MathML xmlns:xlink=http://www.w3.org/1999/xlink>$mathcal{MIST}$</tex> decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules that adaptively select frames and image regions that are closely relevant to the question itself. Visual concepts at different granularities are then processed efficiently through an attention module. In addition, <tex xmlns:mml=http://www.w3.org/1998/Math/MathML xmlns:xlink=http://www.w3.org/1999/xlink>$mathcal{MIST}$</tex> iteratively conducts selection and attention over multiple layers to support reasoning over multiple events. The experimental results on four VideoQA datasets, including AGQA, NExT-QA, STAR, and Env-QA, show that <tex xmlns:mml=http://www.w3.org/1998/Math/MathML xmlns:xlink=http://www.w3.org/1999/xlink>$mathcal{MIST}$</tex> achieves state-of-the-art performance and is superior at efficiency. The code is available at github.com/showlab/mist." @default.
- W4386071468 created "2023-08-23" @default.
- W4386071468 creator A5001133932 @default.
- W4386071468 creator A5043617790 @default.
- W4386071468 creator A5068937750 @default.
- W4386071468 creator A5072690470 @default.
- W4386071468 creator A5075965915 @default.
- W4386071468 creator A5084879213 @default.
- W4386071468 date "2023-06-01" @default.
- W4386071468 modified "2023-09-27" @default.
- W4386071468 title "MIST : Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering" @default.
- W4386071468 cites W2277195237 @default.
- W4386071468 cites W2606982687 @default.
- W4386071468 cites W2765716052 @default.
- W4386071468 cites W2808181286 @default.
- W4386071468 cites W2886641317 @default.
- W4386071468 cites W2904452845 @default.
- W4386071468 cites W2951161814 @default.
- W4386071468 cites W2962949233 @default.
- W4386071468 cites W2963541336 @default.
- W4386071468 cites W2963890755 @default.
- W4386071468 cites W2964306921 @default.
- W4386071468 cites W2984008963 @default.
- W4386071468 cites W3034636873 @default.
- W4386071468 cites W3034727271 @default.
- W4386071468 cites W3034730770 @default.
- W4386071468 cites W3167092180 @default.
- W4386071468 cites W3168640669 @default.
- W4386071468 cites W3175859344 @default.
- W4386071468 cites W3175961224 @default.
- W4386071468 cites W3187433838 @default.
- W4386071468 cites W3197457832 @default.
- W4386071468 cites W3204588463 @default.
- W4386071468 cites W3204868383 @default.
- W4386071468 cites W3205786327 @default.
- W4386071468 cites W4214926101 @default.
- W4386071468 cites W4225414521 @default.
- W4386071468 cites W4285606530 @default.
- W4386071468 cites W4312246181 @default.
- W4386071468 cites W4312777269 @default.
- W4386071468 cites W4312864639 @default.
- W4386071468 cites W4313071966 @default.
- W4386071468 doi "https://doi.org/10.1109/cvpr52729.2023.01419" @default.
- W4386071468 hasPublicationYear "2023" @default.
- W4386071468 type Work @default.
- W4386071468 citedByCount "0" @default.
- W4386071468 crossrefType "proceedings-article" @default.
- W4386071468 hasAuthorship W4386071468A5001133932 @default.
- W4386071468 hasAuthorship W4386071468A5043617790 @default.
- W4386071468 hasAuthorship W4386071468A5068937750 @default.
- W4386071468 hasAuthorship W4386071468A5072690470 @default.
- W4386071468 hasAuthorship W4386071468A5075965915 @default.
- W4386071468 hasAuthorship W4386071468A5084879213 @default.
- W4386071468 hasConcept C121332964 @default.
- W4386071468 hasConcept C154945302 @default.
- W4386071468 hasConcept C165801399 @default.
- W4386071468 hasConcept C185592680 @default.
- W4386071468 hasConcept C188027245 @default.
- W4386071468 hasConcept C23123220 @default.
- W4386071468 hasConcept C41008148 @default.
- W4386071468 hasConcept C44291984 @default.
- W4386071468 hasConcept C62520636 @default.
- W4386071468 hasConcept C66322947 @default.
- W4386071468 hasConcept C71139939 @default.
- W4386071468 hasConceptScore W4386071468C121332964 @default.
- W4386071468 hasConceptScore W4386071468C154945302 @default.
- W4386071468 hasConceptScore W4386071468C165801399 @default.
- W4386071468 hasConceptScore W4386071468C185592680 @default.
- W4386071468 hasConceptScore W4386071468C188027245 @default.
- W4386071468 hasConceptScore W4386071468C23123220 @default.
- W4386071468 hasConceptScore W4386071468C41008148 @default.
- W4386071468 hasConceptScore W4386071468C44291984 @default.
- W4386071468 hasConceptScore W4386071468C62520636 @default.
- W4386071468 hasConceptScore W4386071468C66322947 @default.
- W4386071468 hasConceptScore W4386071468C71139939 @default.
- W4386071468 hasFunder F4320320709 @default.
- W4386071468 hasLocation W43860714681 @default.
- W4386071468 hasOpenAccess W4386071468 @default.
- W4386071468 hasPrimaryLocation W43860714681 @default.
- W4386071468 hasRelatedWork W15319282 @default.
- W4386071468 hasRelatedWork W1594455022 @default.
- W4386071468 hasRelatedWork W2120435877 @default.
- W4386071468 hasRelatedWork W2351286801 @default.
- W4386071468 hasRelatedWork W2356380379 @default.
- W4386071468 hasRelatedWork W2357241418 @default.
- W4386071468 hasRelatedWork W2361152157 @default.
- W4386071468 hasRelatedWork W2805599431 @default.
- W4386071468 hasRelatedWork W4381058564 @default.
- W4386071468 hasRelatedWork W80640783 @default.
- W4386071468 isParatext "false" @default.
- W4386071468 isRetracted "false" @default.
- W4386071468 workType "article" @default.