Fast Music Source Separation

System overview

In the first stage, the audio mixture is turned into an image-like representation. This can be achieved applying a Short Time Fourier Transform and calculating its magnitude. The result is a spectrogram, where the vertical axis corresponds to frequency and the horizontal axis to time.

The second stage consists in turning the spectrogram of the audio mixture into the spectrogram of the desired source. This is achieved using a convolutional neural network. The DSD100 dataset is used to train it. It provides mixtures and sources (bass, drums, vocals and others) from 100 different songs (50 for training and 50 for testing).

The proposed neural network is inspired in DeepConvSep:

Code
Paper

Some of the differences with DeepConvSep are:

Two parallel encoders with different filter shapes are used. One of the encoders is designed to capture vertical patterns in the spectrogram, which correspond to percussive sounds, while the other one captures horizontal patterns, which correspond to more stationary sounds like a guitar chord.
The convolution layers in the encoders have ReLU activations and Batch Normalization is applied.
Latent space is larger.
The weights of the transposed convolution layers in the decoders are shared.
Data augmentation was performed and Others source optimized.

The convolutional neural network takes as input a stereophonic spectrogram and reduces its dimensionality using two parallel encoders. The latent space consists of a fully connected layer and its output is a compact representation of the mixture. Multiple decoders take as input this compact representation. The first layer of the decoders is fully connected and its purpose is to transform the representation of the mixture into a representation of the source to isolate. The next layers, which are convolutional, reconstruct spectrograms from the transformed latent space.

Instead of using a neural network for each source to separate, a single neural network is used, which outputs the estimation of the four sources (bass, drums, vocals and others). This way, processing time is reduced and information of all sources is shared among decoders.

Once the spectrograms are estimated, in the final stage a time-varying filter is performed through soft masks. This way, the processing reduces to filtering the input audio mixture, instead of generating from scratch new audios with the neural network. The resulting spectrograms are taken back to time domain applying the Griffin Lim algorithm.

Currently I am exploring new neural network architectures to separate any kind of sources with a focus in real time applications.

Fast Music Source Separation

Patrick Talbot - A reason to leave

Secretariat - Over the Top

Hollow Ground - Ill fate

Bobby Nobody - Stitch Up

System overview