Background

Understanding and modelling what music is and how it functions is the focus of the Sound and Music Computing field. Its basic aim is to develop veridical and effective computational models of the whole music understanding chain, from sound and structure perception to the kinds of high-level concepts that humans associate with music. This type of research tends to be quantitative-analytical and essentially reductionist, cutting up a phenomenon into individual parts and dimensions, and studying these more or less in isolation. For example in music perception modelling we develop isolated computational models of, for example, rhythm parsing, melody identification and harmony extraction, with rather severe limitations. This approach neglects, and fails to take advantage of, the interactions between different musical dimensions (e.g., the relation between sound and timbre, rhythm, melody, harmony, harmonic rhythm and perceived segment structure). It is likely that a ‘quantum leap’ in computational music perception will only be possible if our research manages to transcend this approach and move towards multi-dimensional models that at least begin to address the complex interplay of the many facets of music.

There is still a wide gap between what can currently be recognised and extracted from music audio signals and the kinds of high-level, semantically meaningful concepts that human listeners associate with music. Current attempts at narrowing this ‘semantic gap’ via, for example, machine learning, are only producing very small incremental progress. One of the fundamental reasons for this lack of major progress seems to be the more or less strict bottom-up approach currently being taken, in which features are extracted from audio signals and ever higher-level features or labels are then computed by analysing and aggregating these features. This inadequacy is increasingly being recognised and the coming years are likely to see an increasing trend towards the integration of high-level expectation (e.g., [Huron, 2006]) and (musical) knowledge in music perception models. This, in turn, constitutes a fruitful opportunity for musicologists, psychologists and others to enter this field and contribute their valuable knowledge.

‘Making sense of’ music is much more than decoding and parsing an incoming stream of sound waves into higher-level objects such as onsets, notes, melodies and harmonies. Music is embedded in a rich web of cultural, historical, commercial and social contexts that influence how it is interpreted and categorised. That is, many qualities or categorisations attributed to a piece of music by listeners cannot solely be explained by the content of the audio signal itself. It is thus clear that high-quality automatic music description and understanding can only be achieved by also taking into account information sources that are external to the music. Current research in Music Information Retrieval (MIR) is taking first cautious steps in that direction by trying to use the Internet as a source of ‘social’ information about music (‘community meta-data’). Much more thorough research into studying and modelling these contextual aspects is to be expected.

In an attempt to characterize the current state of affairs, we can distinguish between several approaches to the computational modelling of music (see diagram). The most general one is the Music Information Processing approach, which is primarily based on data modelling, thus starting from databases and using signal processing and machine learning techniques to develop these models. Computational Musicology develops models originating from music theory in which a thorough formalization contributes to an understanding of the theory itself, its predictions, and its scope. Another approach, Cognitive Musicology, aims at constructing theories of music cognition. Here, the objective is to understand music perception and music performance by formalizing the mental processes involved in listening to and performing music. Finally within the field of Human-Computer Interaction we can use Music Interaction research and the recent paradigm of cultural computing as a way to bring the user and its context into the loop of modelling music.

diagram

 

 

Music information processing

With respect to the formal description of music and music-related information, two different trends shape the existing research in Music Information Retrieval (MIR) [Orio, 2006]. On one side, the symbolic description disregards any audio signal analysis or listener behaviour in order to concentrate just on abstract representations of musical concepts like notes, durations, beats, rhythm and melody patterns, harmony, or structural relationships. On the other side the audio-content and context description approach uses audio, text and behavioural data as input to infer and characterize the ways that music can be described and exploited [Casey et al., 2008].

The research around audio content processing is advancing fast in the direction of knowledge-based processing and top-down processing [Klapuri & Davy, 2006] and this is quite relevant for this project. Also all the work on multimodal processing, user profiling and music ontologies [Raimond, 2008] are very relevant for us.

Tzanetakis et al. [2007] introduced the concept of computational ethnomusicology to refer to the use of computer tools to assist in ethnomusicological research. In their paper, they provided some ideas and specific examples of this type of multidisciplinary research and since then we have seen an increasing number of research articles related to this topic.

Computational musicology

Top-down knowledge of music idioms is a must if we want to improve current state of the art technologies for music understanding. This knowledge can be provided by different means into computational applications, in terms of language-based models, music theoretical concepts or by embodied interactions, among others. One of the main current claims from systematic musicology is the need to promote a user-centred and action-oriented attitude, as it has been proposed in the field of human annotation of music [Lesaffre, 2005]. Musicological methodologies can provide the solution to Information Retrieval self-imposed fictions of semantic gap and glass-ceiling limits [Wiggins, 2009]. For instance, it's being gradually accepted for the music annotation problem, that labelling is necessarily dependent of the analysis' intention/goal for specific applications, and that many of the current evaluation problems might be created by imposing standardized and static ground-truths, which are often misinterpreted by non-specialists as universal laws or rules. Decades of ethnomusicological fieldwork and analysis involving music transcription and transnotation support the compulsory link between notation and analytical goal/intention [Barz and Cooley, 2008].

In music scholarship there is a growing role of formalization and of the notions of testability and falsification. A striking example is the Tonal Pitch Space theory [Lerdahl, 2001] a highly formalized framework that, consequently, has been tested and elaborated upon in a variety of disciplines, ranging from music theory and linguistics and systematic musicology to music technology and music psychology. One of the most powerful potentials of this theory, however, relies on its methodological benefits, founded upon paradigms from humanities research. It is too bad that we do not have a theory like the one of Lerdahl applicable to non-tonal music, which would be more relevant for non-Western music.

One area of musicological research particularly applicable to non-Western music is the work on performance analysis [Gabrielsson, 2003]. Historically, research in music performance has focused on finding general principles underlying the types of expressive ‘deviations’ from the musical score but more recently this has been extended to the study of improvised music. Understanding music performance will require a combination of approaches and disciplines – musicology, AI and machine learning, psychology and cognitive science.

Cognitive musicology

We can consider musical concepts as the outcome of a cognitive process when exposed to a particular music-cultural environment. Such an approach starts from the recognition of the inherent embodiment of cognition: rather than being a static feature, cognitive behaviour is the result of a developmental process of continuous interaction of an agent with its environment. In this view, perception is not a passive act of acquiring information, but an “embodied” active process that is an essential element of cognition [Purwins et al., 2008a; Leman, 2008].

Learning of categories and schemata are important high-level processes in music cognition. Categories are conceptual devices that help to create bundles of musical elements somehow associated in some context. When several musical objects (e.g. notes, chords, phrases, articulations, etc.) share a set of properties or present similarities, they are candidates to be assigned to the same category (e.g. “C#”, “g–minor”, “Phrygian scale”, “appoggiatura”). Several computational approaches for the formation of perceptual categories exist. Marxer et al. [2007] have used hierarchical clustering to model the emergence of categories of scales, motifs, and harmony. Mandler [1979] describes a schema as a knowledge structure “formed on the basis of past experience with objects, scenes, or events and consisting of a set of (usually unconscious) expectations about what things look like and/or the order in which they occur.” Tonal schemata, for example, might be formed through repetitive experience with melodic material. From a computational point of view, schemata have been implemented as frames, scripts, conditional rules, or self-organizing feature maps [Purwins et al., 2008b].

The underlying mechanisms of emotional responses to music have been investigated by Juslin and Västfjäll [2008]. In the circumplex model of affect [Russell, 1980], emotion is spanned by the dimensions of pleasantness (or valence) and arousal and it also has been proved to be cultural dependent [Russel et al., 1989].  Extra-musical context (e.g., personal, social, political and economic) can be represented as a semantic network [Cano et al., 2004] and mathematically modelled as a dynamic system [Newman, 2003].

Music interaction

If we consider culture as the integration of human behaviour that includes attitudes, norms, values, beliefs, actions, communications and groups (ethnic, religious, social, etc.) the recent paradigm of cultural computing tries to give the user an interaction experience that is closely related to the core aspects of his/her culture. Thus in the case of music interaction we need to develop interfaces that let the user engage with the music content using the values and attributes of his/her own culture. As such it is important to understand one's cultural determinants and how to render them during the interaction [Salem & Rauterberg, 2005]. It is also important to identify what kind of interactive experiences will have the most supportive potential [Nakatsu et al., 2005] and what approaches should we develop different from the typical Western ones based on analytical reasoning and on formal logic [Nisbett et al., 2001].

Complex phenomena like music exploration require the use complex interfaces, but interfaces that can be of use to both novel and advanced users. The recent work on table-top interfaces and specially the work on the Reactable as musical interface [Jordà et al., 2007] has given a good starting point for the development of interfaces that can be of use for the exploration of music [Julià and Jordà, 2009; Roy et al., 2004]. Tabletop interfaces favour multi-dimensional and continuous real-time interaction, exploration and multi-user collaboration. They have the potential to maximize bidirectional bandwidth while also contributing to delicate and intimate interaction, and their seamless integration of visual feedback and physical control allows for more natural and direct interaction.