This page lists the datasets created to carry out a number of experiments done as a part of CompMusic. They complement the Corpora of the five music traditions studied. Please visit the respective pages for more details. 

Indian Art Music

Indian Music Tonic Dataset

This dataset comprises 597 commercially available audio music recordings of Indian art music (Hindustani and Carnatic music), each manually annotated with the tonic of the lead artist. This dataset is used as the test corpus for the development of tonic identification approaches.

Carnatic Varnam Dataset

Carnatic varnam dataset is a collection of 28 solo vocal recordings, recorded for our research on intonation analysis of Carnatic ragas. The collection consists of audio recordings, time aligned tala cycle annotations and swara notations in a machine readable format.

Carnatic Music Rhythm Dataset

The Carnatic Music Rhythm Dataset is a sub-collection of 176 excerpts (16.6 hours) in four taalas of Carnatic music with audio, associated tala related metadata and time aligned markers indicating the progression through the tala cycles. It is useful as a test corpus for many automatic rhythm analysis tasks in Carnatic music. A subset with 118 two minute long excerpts (about 4 hours) is also available with equivalent content.

Hindustani Music Rhythm Dataset

The Hindustani Music Rhythm Dataset is a sub-collection of 151 (5 hours) in four taals of Hindustani music with audio, associated taal related metadata and time aligned markers indicating the progression through the taal cycles. The dataset is useful as a test corpus for many automatic rhythm analysis tasks in Hindustani music.

Mridangam Stroke Dataset

The Mridangam Stroke dataset is a collection of 7162 audio examples of individual strokes of the Mridangam in various tonics. The dataset comprises of 10 different strokes played on Mridangams with 6 different tonic values. The dataset can be used for training models for each Mridangam stroke.

Mridangam Tani-avarthanam Dataset

The Mridangam Tani-avarthanam dataset is a transcribed collection of two tani-avarthanams played by the renowned Mridangam maestro Padmavibhushan Umayalpuram K. Sivaraman. The audio was recorded at IIT Madras, India and annotated by professional Carnatic percussionists. It consists of about 24 min of audio and 8800 strokes.

Tabla Solo Dataset

The Tabla Solo Dataset is a transcribed collection of Tabla solo audio recordings spanning compositions from six different Gharanas of Tabla, played by Pt. Arvind Mulgaonkar. The dataset consists of audio and time aligned bol transcriptions.

Turkish Makam Music

Turkish Makam Symbolic Phrase Dataset

This study presents a large machine-readable dataset of Turkish makam music scores segmented into phrases by experts of this music. The dataset consists of 31362 phrases on a set of 480 scores of different compositions annotated by 3 experts.

Turkish Makam Melodıc Phrase Dataset

In this dataset, 899 SymbTr-scores were manually annotated into melodic segments by 3 experts. There are a total of 31362 phrase annotations in this dataset.

Turkish şarki vocal dataset

Turkish şarkı vocal dataset is a collection of 10 recordings of compositions from the vocal form şarkı. The collection has annotations with lyrical lines. Each lyrical phrase is aligned to its corresponding segment in the audio.

Turkish makam acapella sections dataset

The dataset consists of 12 a cappella performances of 11 compositions with total duration of 19 minutes. Solo vocal versions of the originals have been sung by professional singers (originals taken from Turkish şarkı vocal dataset), due to the lack of appropriate a cappella material in this music tradition. A performance has been recorded in sync with the original recording, whereby instrumental sections are left as silence. This assures that the order, in which sections are performed, is kept the same.

Turkish Makam Audıo-Score Alıgnment Dataset

This release contains 6 audio recordings of 4 peşrev compositions from the classical Ottoman-Turkish tradition. There are 51 sections in the audio recordings in total. The total number of the note annotations in the audio recordings are 3896. These annotations typically follow the note sequence in the symbTr. There are 3 inserted and 49 omitted notes in the annotations with respect to the symbTr-scores.

Turkish Makam Sectıon Dataset

This release contains 2095 sections annotated in 257 audio recordings of 58 compositions. The midi and SymbTr-scores of the compositions are also included in the dataset. For more information please refer to the paper.

Turkish Composition Identification Dataset

The repository contains the machine readable music scores of 147 instrumental compositions selected from the SymbTr collection and 743 audio recordings selected from the CompMusic makam corpus. In the dataset there are 360 recordings associated with 87 music scores, forming 362 relevant audio-score pairs.

Turkish Makam Tonıc Dataset

The latest release contains annotated tonic frequencies of more than 2000 audio recordings. If available, the SymbTr-scores of the corresponding compositions performed in the audio recordings are also indicated. For more information please refer to the latest release hosted on GitHub.

Turkish Makam Recognition Dataset

This repository hosts the dataset designed to test makam recognition methodologies on Turkish makam music. It is composed of 50 recording from each of the 20 most common makams in CompMusic Project's Makam Music collection. Currently, the dataset is the largest makam recognition dataset.

Beijing Opera (京剧)

Beijing Opera Percussion Instrument Dataset

Beijing Opera percussion dataset is a collection of 236 examples of isolated strokes spanning the four percussion instrument classes used in Beijing Opera. It can be used to build stroke models for each percussion instrument.

Beijing Opera Percussion Pattern Dataset

Beijing Opera Percussion Pattern (BOPP) dataset is a collection of 133 audio percussion patterns covering five pattern classes. The dataset includes the audio and syllable level transcriptions for the patterns (non-time aligned). It is useful for percussion transcription and classification tasks. The patterns have been extracted from audio recordings of arias and labeled by a musicologist.

Jinju A Cappella Singing Pitch Contour Dataset

Jingju A Cappella Singing Pitch Contour Dataset is a collection of pitch contour segment ground truth for 39 jingju a cappella singing recordings. The dataset includes the ground truth for (1) melodic transcription, (2) pitch contour segmentation. It is useful for melodic transcription and pitch contour segmentation tasks. The pitch contours have been extracted from audio recordings and manually corrected and segmented by a musicologist.

Jinju A Cappella Singing Audio and Boundary Annotation Dataset

This dataset is a collection of boundary annotations of a cappella singing performed by jingju professional and amateur singers. The boundaries have been annotated in a hierarchical way: Line (phrase), syllable, phoneme singing units have been annotated to a jingju a cappella singing audio dataset. Annotation format, units, parsing code and other details please refer to

Jinju A Cappella Singing Audio Extended Dataset

This jingju  a cappella singing audio dataset consists of 120 arias, accounting for 1265 melodic lines. This dataset is also an extension our existing CompMusic jingju corpus, for example, Jingju a cappella singing dataset part1 ( Both professional and amateur singers were invited to the dataset recording sessions, and the most common jingju musical elements have been covered. This dataset is also accompanied by metadata per aria and melodic line annotated for automatic singing evaluation research purpose.

Jingju Music Scores Collection

This is a collection of 92 jingju music scores gathered for the analysis of jingju singing in terms of its musical system. They were transcribed from their original printed sources into a machine readable format, using MuseScore, and exporting them into MusicXML.

Jingju Lyrics Datasets

In order to study the expressive functions of jingju metrical patterns according to its lyrics, a series of different datasets have been created from the Jingju Lyrics Collection, that has been collected through scraping the online repository of jingju libretti Zhongguo jingju xikao 中国京剧戏考. These datasets have been created for the analysis of lyrics of the banshi yuanban, manban, kuaiban and yaoban both in the shengqiang xipi and erhuang (kuaiban is not used in erhuang) by applying NLP techniques, namely toic modelling and document classification.

Annotated Jingju Arias Dataset

The Annotated Jingju Arias Dataset is a collection of 34 jingju arias manually segmented in various levels using the software Praat. The selected arias contain samples of the two main shengqiang in jingju, namely xipi and erhuang, and the five main role types in terms of singing, namely, dan, jing, laodan, laosheng and xiaosheng. The dataset is formed by Praat TextGrid files for each aria, containing tiers for the following information: aria, MusicBrainz ID, artist, school, role type, shengqiang, banshi, line of lyrics, syllables, and percussion patterns.