Companion webpage for the PhD thesis of Georgi Dzhambazov

This page is the companion web page for the PhD thesis titled

Knowledge-based Probabilistic Modeling for Tracking Lyrics in Music Audio Signals

Georgi Dzhambazov

(Last updated: 7 July 2017)

Abstract

In this thesis, we devise computational models for tracking sung lyrics in multi-instrumental music recordings. We consider not only the low-level acoustic characteristics, representing the timbre of the sung phonemes, but also higher-level music knowledge, that is complementary to lyrics. We build probabilistic models, based on dynamic Bayesian networks (DBN) that represent the relation of phoneme transitions to two music knowledge facets: the temporal structure of a lyrics line and the structure of the metrical cycle. In one model we exploit the fact the expected syllable durations depend on their position within a lyrics line. Then in another model, we propose how to estimate vocal onsets by tracking simultaneously the position in the metrical cycle, and how these estimated onsets influence the transitions between consecutive phonemes. Using the proposed models sung lyrics are automatically aligned to written lyrics on datasets from Ottoman Turkish makam and Beijing opera, whereby principles, specific for these music traditions are considered. Both models improve a baseline, unaware of music-specific knowledge. This confirms that music-specific knowledge is an important stepping stone for computationally tracking lyrics, especially in the challenging case of singing with instrumental accompaniment.

A longer and detailed abstract is here

Link to the thesis manuscript

Link to the thesis defense presentation slides

Please click on the headings to expand.

Thesis defense presentation video

Datasets

All the datasets (introduced in Chapter 3.2) used in our work are made publicly available for research purposes. Most of them are version controlled. The companion web pages corresponding to the publications list these associated datasets. They are listed below:

Multi-instrumental lyrics OTMM dataset
A cappella lyrics OTMM dataset
Multi-instrumental vocal onsets OTMM dataset [under construction]
A cappella lyrics jingju dataset

Publications

A list of all publications by the author done as part of the work in MTG can be found here. Here we list the ones relevant for the work presented in this thesis:

Dzhambazov, Georgi, Sertan Şentürk and Xavier Serra (2014). Automatic lyrics-to-audio alignment in classical Turkish music. In Proceedings of 4th International Workshop on Folk Music Analysis (FMA 2014), Istanbul, Turkey, pp. 61–64 [Chapter 3, excluding Sections 3.2 and 3.4.2]
Dzhambazov, Georgi and Xavier Serra. Modeling of phoneme durations for alignment between polyphonic audio and lyrics. In Sound and Music Computing Conference 2015, Maynooth, Ireland, 2015. [Chapters 4.3 and 4.4, Experiments in Chapter 3.5 were run with the baseline model from this paper]
Dzhambazov, Georgi, Yile Yang, Rafael Caro Repetto, and Xavier Serra (2016). Automatic alignment of long syllables in a cappella Beijing opera. In Proceedings of 6th International Workshop on Folk Music Analysis (FMA 2016), Dublin, Ireland, pp. 88–91. [Chapter 4.5]
Dzhambazov, Georgi, Ajay Srinivasamurthy, Sertan Şentürk, and Xavier Serra (2016). On the use of note onsets for improved lyrics-to-audio alignment in Turkish makam music. In Proceedings of the 17th International Society for Music Information Retrieval Conference (ISMIR 2016), New York, NY, USA, pp. 716–722 [Chapter 5.4]

To reproduce these papers, the scripts and figures used to generate them are here

Code

The core code corresponding to the experiments performed as a part of the thesis is organized different git repositories. Links to specific selected scripts/code are given below.

Evaluation Scripts

A tool for evaluation of alignment metrics [Chapter 2.2.1]
A tool for evaluation of percentage of correctly identified phoneme frames [Chapter 3.4.2]

Core algorithms

A python wrapper to align with the Viterbi decoding of htk [Experiments from FMA 2014 paper]
Duration aware lyrics-to-audio alignment [Chapter 4]
- set the parameter WITH_DURATIONS to 0 [for Chapter 3.5]
- package for OTMM [Chapter 4.4]
- package for jingju [Chapter 4.5]
Metrical-accent aware onset detection [Chapter 5.3] Code in this repository is still in progress...
Note onset aware alignment [Chapter 5.4] (set parameters WITH_ORACLE_ONSETS=0, WITH_DURATIONS=0, DETECTION_TOKEN_LEVEL = word )

Other tools

Scripts for training the phoneme (acoustic) GMM [Chapter 3.4.1]
Scripts for preparing training the phoneme GMM with Fuzzy mapping [Chapter 3.4.2.2]
A walkthrough on how to reproduce the htk-type of MFCC extraction in essentia [Chapter 3.3.3]
Example of how to invert MFCC to mel domain [Appendix]

The code for the individual experiments needs refactoring. It will be done soon..untill then if there is any confusion please feel free to contact the author.

Results

Duration aware lyrics-to-audio alignment

A demo of durations derived from music score for OTMM (Chapter 4.4)

1. Create an account in Dunya-web 2. Select OTMM songs that have a vocal part (filter by form şarkı). 3. Click link "Access lyrics player" on the right hand-side (sometimes not available if no score available, etc.

Or you can have a look at an example recording