Most of the research in CompMusic is done by studying five research corpora [Serra, 2014], most of which will be used by Dunya. These research corpora are not fixed, they are data collections from various sources that evolve and grow. From these corpora we create fixed test corpora (datasets) to carry out specific experiments. Here are the links to our currently available components of each corpus.

Carnatic

Audio collection: Audio recordings of concerts obtained from various sources. A bulk of the recordings come from Charsur Digital Workstation.
Editorial metadata: For each audio recording we have editorial information stored and organized as the Carnatic-Dunya collection in MusicBrainz. This information has been carefully curated.
Contextual Information: Information about the music concepts and entities used in our audio collection taken from Wikipedia and Kutcheris.com.
Lyrics: Lyrics of the songs included in the audio collection and obtained from Sahityam.net.
Scores: Scores of the pieces in the audio collection from the archives hosted by Dr. Shivkumar Kalyanaraman. This were manually converted to a machine readable format.
Community information: Discussions and comentaries made by the community and obtained from Rasikas.

Hindustani

Audio collection: Audio recordings obtained from various sources, mainly commercial.. This were manually converted to a machine readable format.
Editorial metadata: For each audio recording we have editorial information stored and organized as the Hindustani-Dunya collection in MusicBrainz. This information has been carefully curated.
Contextual Information: Information about the music concepts and entities used in our audio collection taken from Wikipedia.
Lyrics and Scores: Lyrics and scores of the songs included in the audio collection and obtained from Swarganga Music foundation.

Turkish-makam

Audio collection: Audio recordings obtained from various sources, mainly commercial.
- Part of the collection is online in the Internet Archive.
Score collection: Score collection including plain text files, MusicXML files, and MIDIs are available on GitHub.
Editorial metadata: For each audio recording we have editorial information stored in MusicBrainz. This information has been carefully curated.

Beijing Opera

Audio collection: Audio recordings obtained from commercial releases.
Editorial metadata: For each audio recording we have editorial information stored and organized as the Dunya Beijing Opera collection in MusicBrainz. This information has been carefully curated, maintaining its original Chinese language and script. Romanization in the Hanyu Pinyin system is provided for each release, recording, work and artist either as pseudo-releases or aliases. These metadata can be easily accessed through our Dunya API using the MusicBrainz unique identifier (MBID) of the recordings.
Lyrics: Lyrics are obtained from open repositories in the web, mainly 京剧艺术 and 中国京剧戏考.
Scores: Score collection consists of printed editions from commercial publications. Those scores needed for research purposes will be converted into a machine readable format, maintaining their original jianpu notation.
- Jingju qupu jicheng 京剧曲谱集成 (Collection of jingju scores), 10 vols., Shanghai wenyi chubanshe, Shanghai, 1998.
- Jingju qupu jingxuan 京剧曲谱精选 (Selected scores of jingju), 2 vols., Shanghai yinyue chubanshe, Shanghai, 1998–2005.
- Zhongguo jingju liupai jumu jicheng 中国京剧流派剧目集成 (Collection of plays of Chinese jingju schools), 21 vols., Xueyuan chubanshe, Beijing, 2006–2010.

Arab-Andalusian

Audio collection: Audio recordings obtained from a personal collection of our Arab-Andalusian collaborator, Amin Chaachoo. All these recordings are uploaded online to Internet Archive
Editorial metadata: For each audio recording we have editorial information stored in MusicBrainz. This information has been carefully curated. Nevertheless, there is some cultural specific metadata that cannot be stored in MusicBrainz, such as the nawba, the tab', the mizán, the music form and the start and end time stamps of each section of the recordings. This metadata is stored in our own dataset and it can be easily accessed through our Dunya API using the MusicBrainz unique identifier (MBID) of the recordings.
Lyrics: lyrics for each Arab-Andalusian music recording of the corpus obtained from the songbook "Diwan Al-Ala" by Mehdi Chaachoo (Imprimerie Al Khalij Al Arabi, Tetouan, Morocco, 2009) and selected manually by listening to all the recordings.
- The first version of the lyrics in TSV and JSON format available here.
- Also accessible through our Dunya API
Scores: transcriptions for each Arab-Andalusian music recording of the corpus created by our collaborator, the Arab-Andalusian musicologist Amin Chachoo. [Available Soon (by September 2014)]