Research Corpora

Most of the research in CompMusic has been done by studying five research corpora [Serra, 2014], all of them available from Dunya. These research corpora are not fixed, they are data collections from various sources that evolve and grow. From these corpora we created datasets to carry out specific experiments. The corpora and datasets have been curated and shared according to our Data Management Plan. Here is a description of the available components of each corpus.

Carnatic [view on dunya]

The Carnatic corpus [Srinivasamurthy et al. 2014] includes 2,380 audio recordings (235 concerts, 500 hours), covering 259 artists, 965 compositions, 227 ragas, 15 talas, and 15 forms.

Audio collection: Audio recordings of concerts obtained from various sources. A bulk of the recordings comes from Charsur Digital Workstation and others have been obtained from the artists and made openly available on Internet Archive.
Editorial metadata: For each audio recording we have editorial information stored and organized as the Carnatic-Dunya collection in MusicBrainz. This information has been carefully curated.
Contextual Information: Information about the music concepts and entities used in our audio collection taken mainly from Wikipedia.
Lyrics: Lyrics of the songs included in the audio collection and obtained from Sahityam.net.
Scores: Scores of the pieces in the audio collection from the archives hosted by Dr. Shivkumar Kalyanaraman. This were manually converted to a machine readable format.
Datasets: List of datasets.

Hindustani (dunya.compmusic.upf.edu/hindustani)

The Hindustani corpus [Srinivasamurthy et al. 2014] includes 1,124 audio recordings (236 releases, 305 hours) covering 363 artists, 693 compositions, 195 ragas (melodic modes), 26 talas (metric modes), and 24 forms.

Audio collection: Audio recordings obtained from various sources, mainly commercial. Some were obtained from the artists and made openly available on Internet Archive.
Editorial metadata: For each audio recording we have editorial information stored and organized as the Hindustani-Dunya collection in MusicBrainz. This information has been carefully curated.
Contextual Information: Information about the music concepts and entities used in our audio collection taken from Wikipedia.
Lyrics and Scores: Lyrics and scores of the songs included in the audio collection and obtained from Swarganga Music foundation.
Datasets: List of datasets.

Turkish-makam (dunya.compmusic.upf.edu/makam)

The Turkish makam corpus [Uyar et al. 2014; Şentürk 2016] includes 6,601 audio recordings (420 hours) covering 2,928 works, 811 artists, 111 makams (melodic modes), 74 usuls (metric modes), and 87 forms. The corpus also includes 2,200 scores in MusicXML format.

Audio collection: Audio recordings obtained from various sources, mainly commercial. Part of the collection is openly available from Internet Archive.
Score collection: Score collection including plain text files, MusicXML files, and MIDIs are available on GitHub.
Editorial metadata: For each audio recording we have editorial information stored in MusicBrainz. This information has been carefully curated.
Datasets: List of datasets.

Beijing Opera (dunya.compmusic.upf.edu/jingju)

The jingju corpus [Caro Repetto and Serra 2014] includes 864 audio recordings of sung arias (71 hours) covering 653 arias, 74 singers, the five main role types, and the two main modes. It also includes a representative collection of musicals scores and lyrics.

Audio collection: Audio recordings obtained from commercial releases.
Editorial metadata: For each audio recording we have editorial information stored and organized as the Dunya Beijing Opera collection in MusicBrainz. This information has been carefully curated, maintaining its original Chinese language and script. Romanization in the Hanyu Pinyin system is provided for each release, recording, work and artist either as pseudo-releases or aliases. These metadata can be easily accessed through our Dunya API using the MusicBrainz unique identifier (MBID) of the recordings.
Lyrics: Lyrics are obtained from open repositories in the web, mainly 京剧艺术 and 中国京剧戏考.
Scores: Collection of scores in machine readable format created from the following published sources.
- Jingju qupu jicheng 京剧曲谱集成 (Collection of jingju scores), 10 vols., Shanghai wenyi chubanshe, Shanghai, 1998.
- Jingju qupu jingxuan 京剧曲谱精选 (Selected scores of jingju), 2 vols., Shanghai yinyue chubanshe, Shanghai, 1998–2005.
- Zhongguo jingju liupai jumu jicheng 中国京剧流派剧目集成 (Collection of plays of Chinese jingju schools), 21 vols., Xueyuan chubanshe, Beijing, 2006–2010.
Datasets: List of datasets.

Arab-Andalusian (dunya.compmusic.upf.edu/andalusian)

The Arab-Andalusian corpus [Sordo et al. 2014] includes 338 recordings (112 hours) of performances from the 1960s and 1970s by the three most important schools of Morocco. 165 of the recording correspond to most of the mizans (sections in a larger performance) and of the eleven preserved nawbas (suites). It also includes the score transcriptions of these 165 recordings.

Audio collection: Audio recordings obtained from a personal collection of our Arab-Andalusian collaborator, Amin Chaachoo. All these recordings are openly available on Internet Archive.
Editorial metadata: For each audio recording we have editorial information stored in MusicBrainz. This information has been carefully curated. Nevertheless, there is some cultural specific metadata that cannot be stored in MusicBrainz, such as the nawba, the tab', the mizán, the music form and the start and end time stamps of each section of the recordings. This metadata is stored in our own dataset and it can be easily accessed through our Dunya API using the MusicBrainz unique identifier (MBID) of the recordings.
Lyrics: lyrics for each Arab-Andalusian music recording of the corpus obtained from the songbook "Diwan Al-Ala" by Mehdi Chaachoo (Imprimerie Al Khalij Al Arabi, Tetouan, Morocco, 2009) and selected manually by listening to all the recordings.
- The first version of the lyrics in TSV and JSON format available here.
- Also accessible through our Dunya API
Scores: transcriptions for each Arab-Andalusian music recording of the corpus created by our collaborator, the Arab-Andalusian musicologist Amin Chachoo. All these scores are uploaded online to MuseScore.