Data Management Plan

Project title: Computational Models for the Discovery of the World's Music (CompMusic)
Lead institution: Universitat Pompeu Fabra
Principal Investigator: Xavier Serra
Project duration: 2011–2017
Funded by: European Research Council (ERC Advanced Grant 2010, Grant Agreement No. 267583)
Project website: https://compmusic.upf.edu
Data access portal: https://compmusic.upf.edu/corpora

1. Data Summary

The CompMusic project aimed to advance the automatic description and understanding of music traditions outside the Western canon through culturally informed computational approaches. A key component of the project was the creation, documentation, and public dissemination of research datasets.

The types of data produced and used include:

Audio recordings of traditional, commercial, and archival music performances.
Symbolic music data, including manually or semi-automatically transcribed scores (in formats such as MIDI or MusicXML).
Metadata from online music platforms (e.g., titles, artist information, lyrics, raga, tala, or maqam labels).
Time-aligned annotations of melodic, rhythmic, and structural characteristics (e.g., note onsets, segments, tonic, tempo, sections).
Textual resources related to the musical cultures studied (e.g., artist biographies, liner notes, academic texts).

These data supported research on music information retrieval (MIR), computational musicology, and digital humanities, specifically focusing on traditions such as Hindustani, Carnatic, Ottoman-Turkish, Andalusian, and Beijing Opera music.

2. Applicable Standards and Ethical Principles

All data collection and use respected copyright legislation and, where applicable, cultural sensitivity and data ethics. Publicly available commercial recordings were referenced and not redistributed. Annotated or derived data were published only where permitted.

No personal or sensitive data were collected. When field recordings or performance data from individuals were used, informed consent was obtained, and data were anonymized if necessary.

3. Data Documentation and Quality

Each dataset is documented with detailed README files, metadata standards, annotation guidelines, and accompanying publications.

Key formats and conventions used:

Audio: WAV, MP3
Symbolic music: MIDI, MusicXML
Annotations: TextGrid, CSV, JSON
Metadata: RDF, JSON-LD, XML, TSV

Data versioning and provenance are tracked via Git repositories and DOIs for major datasets. Annotation quality was ensured through expert review and controlled protocols.

4. Storage and Backup During the Project

Data were stored using UPF's secure servers with daily backups during the project’s active phase. Code, metadata, and annotations were version-controlled using Git.

Long-term storage was arranged through Zenodo and UPF repositories.

5. Long-term Preservation and Reuse

The datasets and software suporting them are openly available through:

Licensing:

Public domain data or data with explicit permission is shared under Creative Commons licenses (e.g., CC-BY).
Derived metadata and annotations created by the project team are openly licensed.
For copyrighted commercial audio, only metadata and time-aligned annotations are shared, with references to the original source.

6. Data Sharing and Open Access

The project followed FAIR principles:

Findable: All datasets have persistent identifiers (e.g., DOIs on Zenodo).
Accessible: Openly accessible through web interfaces and APIs.
Interoperable: Use of standard formats in audio and symbolic data.
Reusable: Complete documentation, sample code, and related publications are provided.

CompMusic datasets have been reused in international research and teaching contexts, and cited in numerous academic publications.

7. Responsibilities

The Principal Investigator, Xavier Serra, oversaw the data strategy.
Dataset curation and documentation were carried out by the MTG research team and doctoral researchers.
Licensing and repository management were coordinated in consultation with UPF’s legal and technical services.

8. Resources

The MTG provided infrastructure, expertise, and long-term commitment to dataset curation. Post-project, continued maintenance and dissemination of the data are ensured through:

Integration into MTG’s long-term data initiatives.
Archival hosting on Zenodo, Github, and UPF repositories.
Ongoing citation in academic work and integration into open-source tools such as Essentia and MTG-JS.