The future of learning is doing.
350+ Students trust us
Artificial Intelligence
14th March 2023
Thulasiram Gunipati
Earlier to play music, we would first think which music we want to listen to, search for the right CD, insert it in the CD player and then enjoy it. Every time the CD is over playing, we either need to replay it or replace it. Technology evolved, when you can just say ‘Alexa! play my favourite music’. You can give the command while cooking, reading, bathing - from anywhere, as long as Alexa can hear you and understand what you said. Well, I know, there are videos everywhere on social media, showing the adverse side of the voice actuated AI tools, we can discuss those later. That’s the listening side of the story. What does it take to create a song? Someone needs to write the lyrics, someone puts a tune to it, musicians gather together, rehashing and then shooting the song in a studio which gets recorded. Even if these do not happen in exact this sequence, but all these are generally required to create a song. These also cost a lot, which therefore increases the music licence costs too. What if you could have given a lyric, mentioned the mood or genre of the song, and an artist whom you would have loved to sing it - and the AI generates the music for you! BOOM! We will discuss the ethical and other challenges of this. Read today’s blog to learn if this is doable and how this can be done using Artificial Intelligence. MusicLM to generate models which can produce high quality music from a given text. It should be able to play music constantly for several minutes. It can be conditioned on both text and a melody - It can transform whistled and hummed melodies according to the style described in a text caption.. Also, supports creation of a fast evaluation dataset collected specially for the text to music generation (MusicCaps). Previous works Quantization - The goal is to provide a compact, discrete representation, which at the same time allows for high-fidelity reconstruction. (VQ-VAE) SoundStream is a universal neural audio codec - compress audio and reconstruct - tokenizer. SoundStream uses residual vector quantization (RVQ) Components Of The Model The below components are trained individually Soundstream
It is a universal neural audio codec. It is not only used to compress audio but also reconstruct Quantization is to provide a compact discrete representation of the audio Residual vector quantization (RVQ) is used in this case In the literature they are called acoustics token W2V-Bert
It is a Masked Language Model (MLM) embedding that is extracted from the 7th layer and quantized then using centroids of a k- means over the embeddings. In the literature they are called semantic tokens MuLan
It is a music text joint embedding model consisting of two embedding towers one for each modality. The towers map the two modalities to a shared embedding space of 128 dimensions using contrastive learning. In the literature they are called audio tokens Evaluation of the Model MusicCaps dataset was prepared. It consists of 5.5K music clips each paired with corresponding text descriptions in English, written by ten professional musicians. For each 10-second music clip, MusicCaps provides: (1) a free-text caption consisting of four sentences on average, describing the music and (2) a list of music aspects, describing genre, mood, tempo, singer voices, instrumentation, dissonances, rhythm, etc. On average, the dataset includes eleven aspects per clip Methods used for evaluation Results MusicLM was compared with Mubert and Riffusion . MusicLM WAS performing better than both of them. When the transformer models are directly trained on acoustic tokens from MuLAN tokens, there is a drop in KLD and MCC. Semantic modelling facilitates the adherence to the text description. What are the Challenges I faced Risks with music generation Potential misappropriation of creative content - conducted a thorough study of memorization - when feeding MuLan embeddings to MusicLM, the sequences of generated tokens significantly differ from the corresponding sequences in the training set. References Why MusicLM
Datasets used for the model
Monika Pandey
Monika Pandey
Monika Pandey
Monika Pandey
Anish Roychowdhury
Ananya Dey
Thulasiram Gunipati
Ujjyaini Mitra
Ujjyaini Mitra
Anish Roychowdhury
Ujjyaini Mitra
Ujjyaini Mitra
Satadru Bhattacharya
Thulasiram Gunipati
Thulasiram Gunipati
Thulasiram Gunipati
Form Submitted Successfully