Generating Future: Generating Music from lyrics

Artificial Intelligence

14th March 2023

Thulasiram Gunipati

Earlier to play music, we would first think which music we want to listen to, search for the right CD, insert it in the CD player and then enjoy it. Every time the CD is over playing, we either need to replay it or replace it. Technology evolved, when you can just say ‘Alexa! play my favourite music’. You can give the command while cooking, reading, bathing - from anywhere, as long as Alexa can hear you and understand what you said. Well, I know, there are videos everywhere on social media, showing the adverse side of the voice actuated AI tools, we can discuss those later.

That’s the listening side of the story. What does it take to create a song? Someone needs to write the lyrics, someone puts a tune to it, musicians gather together, rehashing and then shooting the song in a studio which gets recorded. Even if these do not happen in exact this sequence, but all these are generally required to create a song. These also cost a lot, which therefore increases the music licence costs too.

What if you could have given a lyric, mentioned the mood or genre of the song, and an artist whom you would have loved to sing it - and the AI generates the music for you! BOOM! We will discuss the ethical and other challenges of this. Read today’s blog to learn if this is doable and how this can be done using Artificial Intelligence.

Why MusicLM

MusicLM to generate models which can produce high quality music from a given text. It should be able to play music constantly for several minutes. It can be conditioned on both text and a melody - It can transform whistled and hummed melodies according to the style described in a text caption.. Also, supports creation of a fast evaluation dataset collected specially for the text to music generation (MusicCaps).

Previous works

Quantization
Generative models for audio
Conditioned audio generation
Text conditioned Image generation
Joint embedding models for music and text (MuLan)

Quantization - The goal is to provide a compact, discrete representation, which at the same time allows for high-fidelity reconstruction. (VQ-VAE)

SoundStream is a universal neural audio codec - compress audio and reconstruct - tokenizer. SoundStream uses residual vector quantization (RVQ)

Datasets used for the model

MusicCaps- Dataset with music and captions prepared by expert musicians (5.5k samples )
Free Music Archive (FMA) dataset consisting of 280k hours of music

Components Of The Model

The below components are trained individually

Soundstream

It is a universal neural audio codec.

It is not only used to compress audio but also reconstruct

Quantization is to provide a compact discrete representation of the audio

Residual vector quantization (RVQ) is used in this case

In the literature they are called acoustics token

W2V-Bert

It is a Masked Language Model (MLM) embedding that is extracted from the 7th layer and quantized then using centroids of a k- means over the embeddings. In the literature they are called semantic tokens

MuLan

It is a music text joint embedding model consisting of two embedding towers one for each modality. The towers map the two modalities to a shared embedding space of 128 dimensions using contrastive learning. In the literature they are called audio tokens

Evaluation of the Model

MusicCaps dataset was prepared. It consists of 5.5K music clips each paired with corresponding text descriptions in English, written by ten professional musicians. For each 10-second music clip, MusicCaps provides: (1) a free-text caption consisting of four sentences on average, describing the music and (2) a list of music aspects, describing genre, mood, tempo, singer voices, instrumentation, dissonances, rhythm, etc. On average, the dataset includes eleven aspects per clip

Methods used for evaluation

Frechet Audio Distance (FAD)- a reference free audio quality metric. Low scores are preferred
Kl divergence – there is many to many relationship between text description and music clips compatible with them to overcome this a proxy was adopted a classifier trained on multi label classification on audioset is used to compute class predictions for both the generated and the reference music and measure the kl divergence between probability distributions of class predictions kld is expected to be low
Mulan cycle consistency (MCC)- as a joint music-text embedding model, MuLan can be used to quantify the similarity between music-text pairs. We compute the MuLan embedding from the text descriptions in musiccaps as well as the generated music based on them and define MCC metric as the average cosine similarity between these embedding
Along with these qualitative evaluations were also done. It was also checked if the model was trying to memorise the training data

Results

MusicLM was compared with Mubert and Riffusion . MusicLM WAS performing better than both of them. When the transformer models are directly trained on acoustic tokens from MuLAN tokens, there is a drop in KLD and MCC. Semantic modelling facilitates the adherence to the text description.

What are the Challenges I faced

Scarcity of paired audio-text data. Text descriptions of general audio are harder to find. Not possible to capture with a few words the characteristics of acoustic scenes.
Audio is structured along a temporal dimension which makes sequence-wide captions a much weaker level of annotation than an image caption

Risks with music generation

Potential misappropriation of creative content - conducted a thorough study of memorization - when feeding MuLan embeddings to MusicLM, the sequences of generated tokens significantly differ from the corresponding sequences in the training set.

References

Why MusicLM

Datasets used for the model

RELATED BLOGS