Matches in SemOpenAlex for { <https://semopenalex.org/work/W3159968155> ?p ?o ?g. }
- W3159968155 abstract "This dissertation proposes the study of multimodal learning in the context of musical signals. Throughout, we focus on the interaction between audio signals and text information. Among the many text sources related to music that can be used (e.g. reviews, metadata, or social network feedback), we concentrate on lyrics. The singing voice directly connects the audio signal and the text information in a unique way, combining melody and lyrics where a linguistic dimension complements the abstraction of musical instruments. Our study focuses on the audio and lyrics interaction for targeting source separation and informed content estimation.Real-world stimuli are produced by complex phenomena and their constant interaction in various domains. Our understanding learns useful abstractions that fuse different modalities into a joint representation. Multimodal learning describes methods that analyse phenomena from different modalities and their interaction in order to tackle complex tasks. This results in better and richer representations that improve the performance of the current machine learning methods.To develop our multimodal analysis, we need first to address the lack of data containing singing voice with aligned lyrics. This data is mandatory to develop our ideas. Therefore, we investigate how to create such a dataset automatically leveraging resources from the World Wide Web. Creating this type of dataset is a challenge in itself that raises many research questions. We are constantly working with the classic ``chicken or the egg'' problem: acquiring and cleaning this data requires accurate models, but it is difficult to train models without data. We propose to use the teacher-student paradigm to develop a method where dataset creation and model learning are not seen as independent tasks but rather as complementary efforts. In this process, non-expert karaoke time-aligned lyrics and notes describe the lyrics as a sequence of time-aligned notes with their associated textual information. We then link each annotation to the correct audio and globally align the annotations to it. For this purpose, we use the normalized cross-correlation between the voice annotation sequence and the singing voice probability vector automatically, which is obtained using a deep convolutional neural network. Using the collected data we progressively improve that model. Every time we have an improved version, we can in turn correct and enhance the data.Collecting data from the Internet comes with a price and it is error-prone. We propose a novel data cleansing (a well-studied topic for cleaning erroneous labels in datasets) to identify automatically any errors which remain, allowing us to estimate the overall accuracy of the dataset, select points that are correct, and improve erroneous data. Our model is trained by automatically contrasting likely correct label pairs against local deformations of them. We demonstrate that the accuracy of a transcription model improves greatly when trained on filtered data with our proposed strategy compared with the accuracy when trained using the original dataset. After developing the dataset, we center our efforts in exploring the interaction between lyrics and audio in two different tasks.First, we improve lyric segmentation by combining lyrics and audio using a model-agnostic early fusion approach. As a pre-processing step, we create a coordinate representation as self-similarity matrices (SMMs) of the same dimensions for both domains. This allows us to easy adapt an existing deep neural model to capture the structure of both domains. Through experiments, we show that each domain captures complementary information that benefit the overall performance.Secondly, we explore the problem of music source separation (i.e. to isolate the different instruments that appear in an audio mixture) using conditioned learning. In this paradigm, we aim to effectively control data-driven models by context information. We present a novel approach based on the U-Net that implements conditioned learning using Feature-wise Linear Modulation (FiLM). We first formalise the problem as a multitask source separation using weak conditioning. In this scenario, our method performs several instrument separations with a single model without losing performance, adding just a small number of parameters. This shows that we can effectively control a generic neural network with some external information. We then hypothesize that knowing the aligned phonetic information is beneficial for the vocal separation task and investigate how we can integrate conditioning mechanisms into informed-source separation using strong conditioning. We adapt the FiLM technique for improving vocal source separation once we know the aligned phonetic sequence. We show that our strategy outperforms the standard non-conditioned architecture.Finally, we summarise our contributions highlighting the main research questions we approach and our proposed answers. We discuss in detail potential future work, addressing each task individually. We propose new use cases of our dataset as well as ways of improving its reliability, and analyze our conditional approach and the different strategies to improve it." @default.
- W3159968155 created "2021-05-10" @default.
- W3159968155 creator A5062693009 @default.
- W3159968155 date "2020-07-09" @default.
- W3159968155 modified "2023-09-27" @default.
- W3159968155 title "MULTIMODAL ANALYSIS: Informed content estimation and audio source separation" @default.
- W3159968155 cites W1487040789 @default.
- W3159968155 cites W1489504608 @default.
- W3159968155 cites W1498436455 @default.
- W3159968155 cites W1514535095 @default.
- W3159968155 cites W1527575280 @default.
- W3159968155 cites W154472438 @default.
- W3159968155 cites W1566289585 @default.
- W3159968155 cites W1583001605 @default.
- W3159968155 cites W1628307106 @default.
- W3159968155 cites W1647671624 @default.
- W3159968155 cites W1665214252 @default.
- W3159968155 cites W1716320828 @default.
- W3159968155 cites W1876052200 @default.
- W3159968155 cites W1895577753 @default.
- W3159968155 cites W1901129140 @default.
- W3159968155 cites W1921293667 @default.
- W3159968155 cites W1974186229 @default.
- W3159968155 cites W1976069042 @default.
- W3159968155 cites W1986602091 @default.
- W3159968155 cites W1994550352 @default.
- W3159968155 cites W2004361267 @default.
- W3159968155 cites W2014470830 @default.
- W3159968155 cites W2016053056 @default.
- W3159968155 cites W2057745663 @default.
- W3159968155 cites W2059239154 @default.
- W3159968155 cites W2059363583 @default.
- W3159968155 cites W2060998236 @default.
- W3159968155 cites W2069681747 @default.
- W3159968155 cites W2075180943 @default.
- W3159968155 cites W2098796164 @default.
- W3159968155 cites W2098950531 @default.
- W3159968155 cites W2103267130 @default.
- W3159968155 cites W2107598941 @default.
- W3159968155 cites W2112796928 @default.
- W3159968155 cites W2123169318 @default.
- W3159968155 cites W2127851351 @default.
- W3159968155 cites W2128160875 @default.
- W3159968155 cites W2136504847 @default.
- W3159968155 cites W2136655611 @default.
- W3159968155 cites W2144707026 @default.
- W3159968155 cites W2144827818 @default.
- W3159968155 cites W2149557440 @default.
- W3159968155 cites W2150936750 @default.
- W3159968155 cites W2157110803 @default.
- W3159968155 cites W2158508307 @default.
- W3159968155 cites W2161632835 @default.
- W3159968155 cites W2164336224 @default.
- W3159968155 cites W2166444141 @default.
- W3159968155 cites W2191779130 @default.
- W3159968155 cites W2207196656 @default.
- W3159968155 cites W2252268321 @default.
- W3159968155 cites W2261310161 @default.
- W3159968155 cites W2287418003 @default.
- W3159968155 cites W2293078015 @default.
- W3159968155 cites W2293137601 @default.
- W3159968155 cites W2398264106 @default.
- W3159968155 cites W2398618787 @default.
- W3159968155 cites W2405774341 @default.
- W3159968155 cites W2406222150 @default.
- W3159968155 cites W2408688265 @default.
- W3159968155 cites W2408744528 @default.
- W3159968155 cites W2418033038 @default.
- W3159968155 cites W2475687244 @default.
- W3159968155 cites W2510642588 @default.
- W3159968155 cites W2519091744 @default.
- W3159968155 cites W2557865186 @default.
- W3159968155 cites W2559688696 @default.
- W3159968155 cites W2560254426 @default.
- W3159968155 cites W2563534197 @default.
- W3159968155 cites W2574634960 @default.
- W3159968155 cites W2575145750 @default.
- W3159968155 cites W2577008904 @default.
- W3159968155 cites W2586947700 @default.
- W3159968155 cites W2604555320 @default.
- W3159968155 cites W2626792426 @default.
- W3159968155 cites W2669032454 @default.
- W3159968155 cites W2707788252 @default.
- W3159968155 cites W2708109968 @default.
- W3159968155 cites W2711861986 @default.
- W3159968155 cites W2731277327 @default.
- W3159968155 cites W2764251778 @default.
- W3159968155 cites W2767290858 @default.
- W3159968155 cites W2775621926 @default.
- W3159968155 cites W2796571515 @default.
- W3159968155 cites W2799258971 @default.
- W3159968155 cites W2886247548 @default.
- W3159968155 cites W2886396981 @default.
- W3159968155 cites W2898963093 @default.
- W3159968155 cites W2902076983 @default.
- W3159968155 cites W2917340025 @default.
- W3159968155 cites W2938709889 @default.
- W3159968155 cites W2962753171 @default.
- W3159968155 cites W2962762068 @default.
- W3159968155 cites W2962762541 @default.