Matches in SemOpenAlex for { <https://semopenalex.org/work/W4287756239> ?p ?o ?g. }
Showing items 1 to 77 of
77
with 100 items per page.
- W4287756239 abstract "Current methods for learning visually grounded language from videos often rely on text annotation, such as human generated captions or machine generated automatic speech recognition (ASR) transcripts. In this work, we introduce the Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs. To circumvent the need for text annotation, we learn audio-visual representations from randomly segmented video clips and their raw audio waveforms. We train AVLnet on HowTo100M, a large corpus of publicly available instructional videos, and evaluate on image retrieval and video retrieval tasks, achieving state-of-the-art performance. We perform analysis of AVLnet's learned representations, showing our model utilizes speech and natural sounds to learn audio-visual concepts. Further, we propose a tri-modal model that jointly processes raw audio, video, and text captions from videos to learn a multi-modal semantic embedding space useful for text-video retrieval. Our code, data, and trained models will be released at avlnet.csail.mit.edu" @default.
- W4287756239 created "2022-07-26" @default.
- W4287756239 creator A5003725957 @default.
- W4287756239 creator A5004717608 @default.
- W4287756239 creator A5010091252 @default.
- W4287756239 creator A5011795407 @default.
- W4287756239 creator A5015927589 @default.
- W4287756239 creator A5034529775 @default.
- W4287756239 creator A5038527788 @default.
- W4287756239 creator A5039081803 @default.
- W4287756239 creator A5049734237 @default.
- W4287756239 creator A5052325109 @default.
- W4287756239 creator A5056869363 @default.
- W4287756239 creator A5060883710 @default.
- W4287756239 creator A5085020955 @default.
- W4287756239 creator A5086597264 @default.
- W4287756239 date "2020-06-16" @default.
- W4287756239 modified "2023-09-26" @default.
- W4287756239 title "AVLnet: Learning Audio-Visual Language Representations from Instructional Videos" @default.
- W4287756239 doi "https://doi.org/10.48550/arxiv.2006.09199" @default.
- W4287756239 hasPublicationYear "2020" @default.
- W4287756239 type Work @default.
- W4287756239 citedByCount "0" @default.
- W4287756239 crossrefType "posted-content" @default.
- W4287756239 hasAuthorship W4287756239A5003725957 @default.
- W4287756239 hasAuthorship W4287756239A5004717608 @default.
- W4287756239 hasAuthorship W4287756239A5010091252 @default.
- W4287756239 hasAuthorship W4287756239A5011795407 @default.
- W4287756239 hasAuthorship W4287756239A5015927589 @default.
- W4287756239 hasAuthorship W4287756239A5034529775 @default.
- W4287756239 hasAuthorship W4287756239A5038527788 @default.
- W4287756239 hasAuthorship W4287756239A5039081803 @default.
- W4287756239 hasAuthorship W4287756239A5049734237 @default.
- W4287756239 hasAuthorship W4287756239A5052325109 @default.
- W4287756239 hasAuthorship W4287756239A5056869363 @default.
- W4287756239 hasAuthorship W4287756239A5060883710 @default.
- W4287756239 hasAuthorship W4287756239A5085020955 @default.
- W4287756239 hasAuthorship W4287756239A5086597264 @default.
- W4287756239 hasBestOaLocation W42877562391 @default.
- W4287756239 hasConcept C154945302 @default.
- W4287756239 hasConcept C155635449 @default.
- W4287756239 hasConcept C157968479 @default.
- W4287756239 hasConcept C204321447 @default.
- W4287756239 hasConcept C2776321320 @default.
- W4287756239 hasConcept C28490314 @default.
- W4287756239 hasConcept C3017588708 @default.
- W4287756239 hasConcept C41008148 @default.
- W4287756239 hasConcept C41608201 @default.
- W4287756239 hasConcept C49774154 @default.
- W4287756239 hasConcept C61328038 @default.
- W4287756239 hasConceptScore W4287756239C154945302 @default.
- W4287756239 hasConceptScore W4287756239C155635449 @default.
- W4287756239 hasConceptScore W4287756239C157968479 @default.
- W4287756239 hasConceptScore W4287756239C204321447 @default.
- W4287756239 hasConceptScore W4287756239C2776321320 @default.
- W4287756239 hasConceptScore W4287756239C28490314 @default.
- W4287756239 hasConceptScore W4287756239C3017588708 @default.
- W4287756239 hasConceptScore W4287756239C41008148 @default.
- W4287756239 hasConceptScore W4287756239C41608201 @default.
- W4287756239 hasConceptScore W4287756239C49774154 @default.
- W4287756239 hasConceptScore W4287756239C61328038 @default.
- W4287756239 hasLocation W42877562391 @default.
- W4287756239 hasOpenAccess W4287756239 @default.
- W4287756239 hasPrimaryLocation W42877562391 @default.
- W4287756239 hasRelatedWork W1587401114 @default.
- W4287756239 hasRelatedWork W180632291 @default.
- W4287756239 hasRelatedWork W2121486117 @default.
- W4287756239 hasRelatedWork W2122924390 @default.
- W4287756239 hasRelatedWork W2403424637 @default.
- W4287756239 hasRelatedWork W2525342915 @default.
- W4287756239 hasRelatedWork W2794873916 @default.
- W4287756239 hasRelatedWork W4300529166 @default.
- W4287756239 hasRelatedWork W90026711 @default.
- W4287756239 hasRelatedWork W1591608209 @default.
- W4287756239 isParatext "false" @default.
- W4287756239 isRetracted "false" @default.
- W4287756239 workType "article" @default.