SemOpenAlex |

SemOpenAlex

Matches in SemOpenAlex for { <https://semopenalex.org/work/W3190982062> ?p ?o ?g. }

Showing items 1 to 64 of 64 with 100 items per page.

W3190982062 abstract "Multimedia integrating different modalities such as text, image, and video, provides users with much convenience in the digital era. Researchers have been establishing multimedia infrastructure over the recent decades, and nowadays multimedia content can be delivered to almost everyone and everywhere. With the fast development of the media world, the multimedia research community has focused attention on multimedia content analytics, which aims to recognise and represent semantic information from various data sources and content types. Vision and language are two representative content forms out of various multimedia formats. This dissertation investigates the interactions between vision and language modalities to enhance the comprehension ability of multimedia content analytics methods. The main challenges of multimedia content analytics come from the feature representations of visual and textual content and the intrinsic modality gap between them, as well as the time-consuming training process. From a visual content aspect, although a convolutional neural network based model can extract visual features that are effective for conventional computer vision tasks such as image classification, the learned representations have limitations when generalising to advanced visual comprehension tasks including image captioning and visual dialogue. On the other hand, from a language aspect, the language models learn word embeddings for textual content representation and generation. To generate high quality text, an image caption for instance, the model must be trained with high quality data. However, the quality of the training data cannot be guaranteed, and imperfect annotations will inevitably lead to output of subpar quality. In addition, modality transition and model training efficiency problems are worth investigating to further enhance the model usability. To address the aforementioned challenges, this dissertation concentrates on model effectiveness and efficiency. Firstly, the depth map and scene graph are exploited to enhance the visual representations derived from the image. Chapter 2 introduces the depth-aware attention model for image paragraph captioning. A depth map is estimated to augment visual cues for more accurate, logical and diverse paragraph generation. The object relationships are discovered for the visual dialogue model in Chapter 3. The objects and their interactions are extracted from the image to form a scene graph, and then the graph structure is preserved in a novel hierarchical graph convolutional network. The dialogue reasoning module can then benefit from the comprehensive visual features extracted via this process. Secondly, the effectiveness of the language model is investigated in Chapter 4. A number of annotation quality issues are identified in the image caption training data collected from an online crowd-sourcing platform. The human-consensus loss is proposed to allow the model to learn from training data that includes imperfect annotations by setting a high priority training focus on high quality annotations. Thirdly, the modality gap between vision and language is explicitly addressed with a modality transition module in Chapter 5. The proposed modality transition module ensures a smooth transition from visual features to semantic embeddings for more precise and context-aware caption generation. Lastly, Chapter 6 considers the training efficiency of the image captioning model. The training deficiency is addressed with a well-engineered attention mechanism which can be trained in parallel. The training time is, therefore, significantly reduced, whilst maintaining competitive model performance." @default.
W3190982062 created "2021-08-16" @default.
W3190982062 creator A5045134787 @default.
W3190982062 date "2021-08-02" @default.
W3190982062 modified "2023-09-24" @default.
W3190982062 title "Multimedia content analytics with modality transition" @default.
W3190982062 doi "https://doi.org/10.14264/accb67e" @default.
W3190982062 hasPublicationYear "2021" @default.
W3190982062 type Work @default.
W3190982062 sameAs 3190982062 @default.
W3190982062 citedByCount "0" @default.
W3190982062 crossrefType "dissertation" @default.
W3190982062 hasAuthorship W3190982062A5045134787 @default.
W3190982062 hasConcept C111472728 @default.
W3190982062 hasConcept C115961682 @default.
W3190982062 hasConcept C138885662 @default.
W3190982062 hasConcept C144024400 @default.
W3190982062 hasConcept C154945302 @default.
W3190982062 hasConcept C157657479 @default.
W3190982062 hasConcept C199360897 @default.
W3190982062 hasConcept C2522767166 @default.
W3190982062 hasConcept C2779530757 @default.
W3190982062 hasConcept C2779903281 @default.
W3190982062 hasConcept C2780226545 @default.
W3190982062 hasConcept C36289849 @default.
W3190982062 hasConcept C41008148 @default.
W3190982062 hasConcept C49774154 @default.
W3190982062 hasConcept C511192102 @default.
W3190982062 hasConcept C79158427 @default.
W3190982062 hasConcept C81363708 @default.
W3190982062 hasConceptScore W3190982062C111472728 @default.
W3190982062 hasConceptScore W3190982062C115961682 @default.
W3190982062 hasConceptScore W3190982062C138885662 @default.
W3190982062 hasConceptScore W3190982062C144024400 @default.
W3190982062 hasConceptScore W3190982062C154945302 @default.
W3190982062 hasConceptScore W3190982062C157657479 @default.
W3190982062 hasConceptScore W3190982062C199360897 @default.
W3190982062 hasConceptScore W3190982062C2522767166 @default.
W3190982062 hasConceptScore W3190982062C2779530757 @default.
W3190982062 hasConceptScore W3190982062C2779903281 @default.
W3190982062 hasConceptScore W3190982062C2780226545 @default.
W3190982062 hasConceptScore W3190982062C36289849 @default.
W3190982062 hasConceptScore W3190982062C41008148 @default.
W3190982062 hasConceptScore W3190982062C49774154 @default.
W3190982062 hasConceptScore W3190982062C511192102 @default.
W3190982062 hasConceptScore W3190982062C79158427 @default.
W3190982062 hasConceptScore W3190982062C81363708 @default.
W3190982062 hasLocation W31909820621 @default.
W3190982062 hasOpenAccess W3190982062 @default.
W3190982062 hasPrimaryLocation W31909820621 @default.
W3190982062 hasRelatedWork W1013863 @default.
W3190982062 hasRelatedWork W10589481 @default.
W3190982062 hasRelatedWork W12449752 @default.
W3190982062 hasRelatedWork W12547364 @default.
W3190982062 hasRelatedWork W13662834 @default.
W3190982062 hasRelatedWork W4448860 @default.
W3190982062 hasRelatedWork W6244538 @default.
W3190982062 hasRelatedWork W8107681 @default.
W3190982062 hasRelatedWork W8219677 @default.
W3190982062 hasRelatedWork W8503766 @default.
W3190982062 isParatext "false" @default.
W3190982062 isRetracted "false" @default.
W3190982062 magId "3190982062" @default.
W3190982062 workType "dissertation" @default.