Matches in SemOpenAlex for { <https://semopenalex.org/work/W3209013111> ?p ?o ?g. }
- W3209013111 abstract "The recent success of transformer models in language, such as BERT, has motivated the use of such architectures for multi-modal feature learning and tasks. However, most multi-modal variants (e.g., ViLBERT) have limited themselves to visual-linguistic data. Relatively few have explored its use in audio-visual modalities, and none, to our knowledge, illustrate them in the context of granular audio-visual detection or segmentation tasks such as sound source separation and localization. In this work, we introduce TriBERT -- a transformer-based architecture, inspired by ViLBERT, which enables contextual feature learning across three modalities: vision, pose, and audio, with the use of flexible co-attention. The use of pose keypoints is inspired by recent works that illustrate that such representations can significantly boost performance in many audio-visual scenarios where often one or more persons are responsible for the sound explicitly (e.g., talking) or implicitly (e.g., sound produced as a function of human manipulating an object). From a technical perspective, as part of the TriBERT architecture, we introduce a learned visual tokenization scheme based on spatial attention and leverage weak-supervision to allow granular cross-modal interactions for visual and pose modalities. Further, we supplement learning with sound-source separation loss formulated across all three streams. We pre-train our model on the large MUSIC21 dataset and demonstrate improved performance in audio-visual sound source separation on that dataset as well as other datasets through fine-tuning. In addition, we show that the learned TriBERT representations are generic and significantly improve performance on other audio-visual tasks such as cross-modal audio-visual-pose retrieval by as much as 66.7% in top-1 accuracy." @default.
- W3209013111 created "2021-11-08" @default.
- W3209013111 creator A5042674299 @default.
- W3209013111 creator A5053011888 @default.
- W3209013111 creator A5059980933 @default.
- W3209013111 date "2021-10-26" @default.
- W3209013111 modified "2023-09-27" @default.
- W3209013111 title "TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation" @default.
- W3209013111 cites W1797158261 @default.
- W3209013111 cites W2106488367 @default.
- W3209013111 cites W2120847449 @default.
- W3209013111 cites W2143169494 @default.
- W3209013111 cites W2163922914 @default.
- W3209013111 cites W2194775991 @default.
- W3209013111 cites W2407685581 @default.
- W3209013111 cites W2526050071 @default.
- W3209013111 cites W2619697695 @default.
- W3209013111 cites W2651884604 @default.
- W3209013111 cites W2660943524 @default.
- W3209013111 cites W2773686055 @default.
- W3209013111 cites W2783457476 @default.
- W3209013111 cites W2798122215 @default.
- W3209013111 cites W2944294033 @default.
- W3209013111 cites W2962699416 @default.
- W3209013111 cites W2962715207 @default.
- W3209013111 cites W2962865004 @default.
- W3209013111 cites W2962960500 @default.
- W3209013111 cites W2962970472 @default.
- W3209013111 cites W2963403868 @default.
- W3209013111 cites W2963603913 @default.
- W3209013111 cites W2963781481 @default.
- W3209013111 cites W2963801643 @default.
- W3209013111 cites W2964001806 @default.
- W3209013111 cites W2964048159 @default.
- W3209013111 cites W2964207404 @default.
- W3209013111 cites W2970608575 @default.
- W3209013111 cites W2981851635 @default.
- W3209013111 cites W2982619606 @default.
- W3209013111 cites W2988200020 @default.
- W3209013111 cites W2995460200 @default.
- W3209013111 cites W2996889020 @default.
- W3209013111 cites W2998356391 @default.
- W3209013111 cites W3017343282 @default.
- W3209013111 cites W3024979138 @default.
- W3209013111 cites W3034727271 @default.
- W3209013111 cites W3041053424 @default.
- W3209013111 cites W3096609285 @default.
- W3209013111 cites W3102619627 @default.
- W3209013111 cites W3118120400 @default.
- W3209013111 cites W3123318516 @default.
- W3209013111 cites W3123709248 @default.
- W3209013111 cites W3150049814 @default.
- W3209013111 cites W3154807520 @default.
- W3209013111 cites W3160817565 @default.
- W3209013111 cites W3121735241 @default.
- W3209013111 hasPublicationYear "2021" @default.
- W3209013111 type Work @default.
- W3209013111 sameAs 3209013111 @default.
- W3209013111 citedByCount "0" @default.
- W3209013111 crossrefType "posted-content" @default.
- W3209013111 hasAuthorship W3209013111A5042674299 @default.
- W3209013111 hasAuthorship W3209013111A5053011888 @default.
- W3209013111 hasAuthorship W3209013111A5059980933 @default.
- W3209013111 hasConcept C107457646 @default.
- W3209013111 hasConcept C119857082 @default.
- W3209013111 hasConcept C121332964 @default.
- W3209013111 hasConcept C144024400 @default.
- W3209013111 hasConcept C153083717 @default.
- W3209013111 hasConcept C154945302 @default.
- W3209013111 hasConcept C165801399 @default.
- W3209013111 hasConcept C2776864781 @default.
- W3209013111 hasConcept C2779903281 @default.
- W3209013111 hasConcept C28490314 @default.
- W3209013111 hasConcept C3017588708 @default.
- W3209013111 hasConcept C36289849 @default.
- W3209013111 hasConcept C41008148 @default.
- W3209013111 hasConcept C49774154 @default.
- W3209013111 hasConcept C59404180 @default.
- W3209013111 hasConcept C62520636 @default.
- W3209013111 hasConcept C66322947 @default.
- W3209013111 hasConceptScore W3209013111C107457646 @default.
- W3209013111 hasConceptScore W3209013111C119857082 @default.
- W3209013111 hasConceptScore W3209013111C121332964 @default.
- W3209013111 hasConceptScore W3209013111C144024400 @default.
- W3209013111 hasConceptScore W3209013111C153083717 @default.
- W3209013111 hasConceptScore W3209013111C154945302 @default.
- W3209013111 hasConceptScore W3209013111C165801399 @default.
- W3209013111 hasConceptScore W3209013111C2776864781 @default.
- W3209013111 hasConceptScore W3209013111C2779903281 @default.
- W3209013111 hasConceptScore W3209013111C28490314 @default.
- W3209013111 hasConceptScore W3209013111C3017588708 @default.
- W3209013111 hasConceptScore W3209013111C36289849 @default.
- W3209013111 hasConceptScore W3209013111C41008148 @default.
- W3209013111 hasConceptScore W3209013111C49774154 @default.
- W3209013111 hasConceptScore W3209013111C59404180 @default.
- W3209013111 hasConceptScore W3209013111C62520636 @default.
- W3209013111 hasConceptScore W3209013111C66322947 @default.
- W3209013111 hasLocation W32090131111 @default.
- W3209013111 hasOpenAccess W3209013111 @default.
- W3209013111 hasPrimaryLocation W32090131111 @default.