Jointly Detecting and Separating Singing Voice: A Multi-Task - - PowerPoint PPT Presentation

jointly detecting and separating singing voice a multi
SMART_READER_LITE
LIVE PREVIEW

Jointly Detecting and Separating Singing Voice: A Multi-Task - - PowerPoint PPT Presentation

Jointly Detecting and Separating Singing Voice: A Multi-Task Approach Daniel Stoller 1 , Sebastian Ewert 2 , Simon Dixon 1 1 Centre for Digital Music Queen Mary University of London 2 Spotify London LVA ICA 05.07.2018 Work was conducted


slide-1
SLIDE 1

Jointly Detecting and Separating Singing Voice: A Multi-Task Approach

Daniel Stoller1, Sebastian Ewert2∗, Simon Dixon1

1Centre for Digital Music Queen Mary University of London 2Spotify London

LVA ICA 05.07.2018

∗Work was conducted at Queen Mary University of London
  • D. Stoller, S. Ewert, S. Dixon (QMUL)
Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 1 / 13
slide-2
SLIDE 2

Vocal separation

Introduction

Main task: Separate vocals from music pieces Applications: Karaoke generation, singer identification, voice analysis...

  • D. Stoller, S. Ewert, S. Dixon (QMUL)
Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 2 / 13
slide-3
SLIDE 3

Vocal separation

Challenges

Difficult task, small multi-track datasets ⇒ Overfitting Give model more knowledge:

Regularization (e.g. weight decay)

Music Voice Accompaniment Regularise Separator model
  • D. Stoller, S. Ewert, S. Dixon (QMUL)
Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 3 / 13
slide-4
SLIDE 4

Vocal separation

Challenges

Difficult task, small multi-track datasets ⇒ Overfitting Give model more knowledge:

Knowledge-driven (e.g. KAM [4])

Music Voice Accompaniment Restrict Separator model
  • D. Stoller, S. Ewert, S. Dixon (QMUL)
Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 3 / 13
slide-5
SLIDE 5

Vocal separation

Challenges

Difficult task, small multi-track datasets ⇒ Overfitting Give model more knowledge:

Informed source separation [2]

Music Voice Accompaniment Separator model Side information
  • D. Stoller, S. Ewert, S. Dixon (QMUL)
Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 3 / 13
slide-6
SLIDE 6

Vocal separation

Challenges

Difficult task, small multi-track datasets ⇒ Overfitting Give model more knowledge:

Integrate information from related tasks/datasets

Music (dataset A) Voice Accompaniment Separator model Music (dataset B) Share information Model Label
  • D. Stoller, S. Ewert, S. Dixon (QMUL)
Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 3 / 13
slide-7
SLIDE 7

Goal

Which other tasks could help? Vocal activity detection is promising:

Knowing vocal activity improves vocal separation [1] Vocal detection networks learn a form of separation: [5]

  • D. Stoller, S. Ewert, S. Dixon (QMUL)
Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 4 / 13
slide-8
SLIDE 8

Initial approach

Using additional non-vocal sections

U-Net adaptation [3] as separator, MSE loss Sample instrumental sections also from SVD databases ⇒ Diversifies instrumental training data

Song A SVS Database Song B SVD Database ? ? ?

  • D. Stoller, S. Ewert, S. Dixon (QMUL)
Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 5 / 13
slide-9
SLIDE 9

Initial approach

Results

Training (SVD) Training (SVS) Validation/Test (SVS) DSD100 Jamendo DSD100 Electro RWC

Performance decrease

  • D. Stoller, S. Ewert, S. Dixon (QMUL)
Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 6 / 13
slide-10
SLIDE 10

Initial approach

Results

Training (SVD) Training (SVS) Validation/Test (SVS) DSD100 Jamendo DSD100 Electro RWC CCMixter iKala

Performance increase

  • D. Stoller, S. Ewert, S. Dixon (QMUL)
Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 6 / 13
slide-11
SLIDE 11

Initial approach

Results

Training (SVD) Training (SVS) Validation/Test (SVS) DSD100 Jamendo DSD100 Electro RWC CCMixter iKala CCMixter iKala MedleyDB MedleyDB

Performance decrease

  • D. Stoller, S. Ewert, S. Dixon (QMUL)
Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 6 / 13
slide-12
SLIDE 12

Initial approach

Results

Training (SVD) Training (SVS) Validation/Test (SVS) DSD100 Jamendo DSD100 Electro RWC CCMixter iKala CCMixter iKala MedleyDB MedleyDB

Performance decrease Dataset bias?

  • D. Stoller, S. Ewert, S. Dixon (QMUL)
Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 6 / 13
slide-13
SLIDE 13

Dataset bias

Analysis

DSD MDB CCM iKala Jam. RWC 0.1 0.2 0.3 a) Mean RMS of mixture DSD MDB CCM iKala 2 4 6 b) Accomp. to vocals RMS ratio DSD MDB CCM iKala Jam. RWC 0.0 0.2 0.4 0.6 0.8 1.0 c) Mean vocal activity
  • D. Stoller, S. Ewert, S. Dixon (QMUL)
Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 7 / 13
slide-14
SLIDE 14

Multi-task approach

Introduction and motivation

Key idea: Predict both audio and label

Music (SVS dataset) Voice Accompaniment Separator model Music (SVD dataset) Detection model Voice activity label Shared weights
  • D. Stoller, S. Ewert, S. Dixon (QMUL)
Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 8 / 13
slide-15
SLIDE 15

Multi-task approach

Introduction and motivation

Key idea: Predict both audio and label

Music (SVS dataset) Voice Accompaniment Separator model Music (SVD dataset) Detection model Voice activity label Shared weights = ||s − (m)| LMSE (m,s)∼p1 1 N fϕ |2 = log ( |m) LCE (m,o)∼p2 1 T ∑ t=1 T pt ϕ ot = α + (1 − α) LMTL LMSE LCE
  • D. Stoller, S. Ewert, S. Dixon (QMUL)
Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 8 / 13
slide-16
SLIDE 16

Multi-task approach

Introduction and motivation

Key idea: Predict both audio and label

Music (SVS dataset) Voice Accompaniment Separator model Music (SVD dataset) Detection model Voice activity label Shared weights = ||s − (m)| LMSE (m,s)∼p1 1 N fϕ |2 = log ( |m) LCE (m,o)∼p2 1 T ∑ t=1 T pt ϕ ot = α + (1 − α) LMTL LMSE LCE

Robust to dataset bias and label accuracy

  • D. Stoller, S. Ewert, S. Dixon (QMUL)
Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 8 / 13
slide-17
SLIDE 17

Multi-task approach

Introduction and motivation

Key idea: Predict both audio and label

Music (SVS dataset) Voice Accompaniment Separator model Music (SVD dataset) Detection model Voice activity label Shared weights = ||s − (m)| LMSE (m,s)∼p1 1 N fϕ |2 = log ( |m) LCE (m,o)∼p2 1 T ∑ t=1 T pt ϕ ot = α + (1 − α) LMTL LMSE LCE

Can train with vocal sections from SVD data

  • D. Stoller, S. Ewert, S. Dixon (QMUL)
Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 8 / 13
slide-18
SLIDE 18

Multi-task approach

Introduction and motivation

Key idea: Predict both audio and label

Music (SVS dataset) Voice Accompaniment Separator model Music (SVD dataset) Detection model Voice activity label Shared weights = ||s − (m)| LMSE (m,s)∼p1 1 N fϕ |2 = log ( |m) LCE (m,o)∼p2 1 T ∑ t=1 T pt ϕ ot = α + (1 − α) LMTL LMSE LCE

Needs only mixture at test time

  • D. Stoller, S. Ewert, S. Dixon (QMUL)
Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 8 / 13
slide-19
SLIDE 19

Multi-task approach

Introduction and motivation

Key idea: Predict both audio and label

Music (SVS dataset) Voice Accompaniment Separator model Music (SVD dataset) Detection model Voice activity label Shared weights = ||s − (m)| LMSE (m,s)∼p1 1 N fϕ |2 = log ( |m) LCE (m,o)∼p2 1 T ∑ t=1 T pt ϕ ot = α + (1 − α) LMTL LMSE LCE

Solves two tasks at once

  • D. Stoller, S. Ewert, S. Dixon (QMUL)
Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 8 / 13
slide-20
SLIDE 20

Experimental setup

Model architecture and dataset

DSD100 as SVS, Jamendo as SVD training data

  • D. Stoller, S. Ewert, S. Dixon (QMUL)
Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 9 / 13
slide-21
SLIDE 21

Experimental setup

Evaluation metrics: AU-ROC, MSE, SDR

AU-ROC for SVD MSE and SDR/SIR/SAR for separation

SDR gives log(0) for non-vocal sections (≈ 10%) ⇒ Also measure RMS of vocal estimates for non-vocal sections

  • D. Stoller, S. Ewert, S. Dixon (QMUL)
Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 10 / 13
slide-22
SLIDE 22

Results

Single-task vs. multi-task model

Metric Vocals Accompaniment AU-ROC MSE Non-voc. RMS SDR SIR SAR SDR SIR SAR SVD 0.9239
  • Model
SVS
  • 0.01865
0.0194 2.83 5.27 6.88 6.71 14.75 13.25 Ours 0.9250 0.01755 0.0155 2.86 5.56 6.23 6.69 13.24 14.11

Table: Comparing SVS and SVD baseline with our approach

  • D. Stoller, S. Ewert, S. Dixon (QMUL)
Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 11 / 13
slide-23
SLIDE 23

Results

Qualitative comparison

  • D. Stoller, S. Ewert, S. Dixon (QMUL)
Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 12 / 13
slide-24
SLIDE 24

Summary

Current SotA methods only use multi-track data Our approach also uses SVD databases Improved separation and detection performance Future work: Larger datasets, more related tasks

  • D. Stoller, S. Ewert, S. Dixon (QMUL)
Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 13 / 13
slide-25
SLIDE 25

T.-S. Chan, T.-C. Yeh, Z.-C. Fan, H.-W. Chen, L. Su, Y.-H. Yang, and R. Jang. Vocal activity informed singing voice separation with the ikala dataset. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pages 718–722. IEEE, 2015.

  • S. Ewert, B. Pardo, M. M¨

uller, and M. D. Plumbley. Score-informed source separation for musical audio recordings: An

  • verview.

IEEE Signal Processing Magazine, 31(3):116–124, 2014.

  • A. Jansson, E. J. Humphrey, N. Montecchio, R. Bittner, A. Kumar,

and T. Weyde. Singing voice separation with deep U-Net convolutional networks. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pages 323–332, 2017.

  • A. Liutkus, D. Fitzgerald, and Z. Rafii.

Scalable audio separation with light kernel additive modelling.

  • D. Stoller, S. Ewert, S. Dixon (QMUL)
Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 13 / 13
slide-26
SLIDE 26

In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 76–80. IEEE, 2015.

  • J. Schl¨

uter. Learning to pinpoint singing voice from weakly labeled examples. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pages 44–50, 2016.

  • D. Stoller, S. Ewert, S. Dixon (QMUL)
Jointly Detecting and Separating Singing V. LVA ICA 05.07.2018 13 / 13
slide-27
SLIDE 27

Wave-U-Net

A Multi-Scale Neural Network for End-to-End Audio Source Separation

DANIEL STOLLER 1, SEBASTIAN EWERT 2, SIMON DIXON 1

1 QUEEN MARY UNIVERSITY OF LONDON 2 SPOTIFY

slide-28
SLIDE 28

Previous work

Mostly spectrogram-based [1,2,3]

  • Problem: Reconstruct source signal from its spectrogram estimates
  • Result: Output artifacts
slide-29
SLIDE 29

Previous work

Recently: Few time-domain approaches [4,5]

  • Problem: Model long-term dependencies in raw audio
  • Result: Context-deprived [4] or slow [5] models
slide-30
SLIDE 30

Our solution: Wave-U-Net

Inspired by U-Net [1,6] and Wavelets Core idea: Feature hierarchy

  • Features at different timescales
  • Efficient long-term dependency modelling

Simple system

slide-31
SLIDE 31

Upsampling

Commonly used: Transposed convolutions Introduces high-frequency noise Solutions:

  • Linear interpolation
  • Learned upsampling:
slide-32
SLIDE 32

Context-aware prediction

Border artifacts with existing systems (equal no. of input & output timesteps): Solution: No zero-padding for convolutions => Prediction of source only for centre piece of mixture input

slide-33
SLIDE 33

Results

Improvements over spectrogram-based equivalent Encouraging performance in SiSec challenge Further improvements with

  • Context-aware prediction
  • Stereo handling

Code, trained model and audio examples: https://github.com/f90/Wave-U-Net

slide-34
SLIDE 34

References

[1] Jansson, A.; Humphrey, E. J.; Montecchio, N.; Bittner, R.; Kumar, A. & Weyde, T. Singing Voice Separation with Deep U-Net Convolutional Networks Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2017, 323-332 [2] Huang, P.-S.; Chen, S. D.; Smaragdis, P. & Hasegawa-Johnson, M. Singing-voice separation from monaural recordings using robust principal component analysis 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012, 57-60 [3] Uhlich, S.; Giron, F. & Mitsufuji, Y. Deep neural network based instrument extraction from music 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, 2135-2139 [4] Grais, E. M.; Ward, D. & Plumbley, M. D. Raw Multi-Channel Audio Source Separation using Multi-Resolution Convolutional Auto-Encoders arXiv preprint arXiv:1803.00702, 2018 [5] Luo, Y. & Mesgarani, N. TasNet: time-domain audio separation network for real-time, single-channel speech separation CoRR, 2017, abs/1711.00541 [6] Ronneberger, O.; Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation International Conference on Medical Image Computing and Computer-Assisted Intervention, 2015, 234-241