Music Information Retrieval

Relevant Lectures: 3.2T (guest lecture by Cynthia Liem)

This chapter introduces information retrieval through the lens of its application to music. We first look at audio music processing, then at some case studies, and finally note that music can also be a multimedia experience.

Audio Music Processing

This chapter focuses on content-based music processing based on audio signals, and does not consider advanced music theory or symbolic representations of music.

Spectrograms

“A spectrogram is a visual representation of the spectrum of frequencies in a sound or other signal as they vary with time” (Wikipedia). It looks like this:

Spectrogram

(Image source: WikiMedia Commons)

Music Fundamentals

Acoustic Concepts

Pitch is “the quality that makes it possible to judge sounds as ‘higher’ and ‘lower’ in the sense associated with music” (Wikipedia).
- An overtone is “a ‘partial wave’ that can be either a harmonic other than the fundamental, or an inharmonic partial. A harmonic frequency is an integer multiple of the fundamental frequency” (Wikipedia).
Timbre is “the perceived sound quality of a musical note, sound, or tone that distinguishes different types of sound production, such as choir voices and musical instruments, such as string instruments, wind instruments, and percussion instruments” (Wikipedia).
Loudness is “the characteristic of a sound that [correlates to the sound’s] physical strength (amplitude)” (Wikipedia).
- Loudness is often said to be subjective, and dependent on pitch. ISO 226:2003 shows red “equal loudness” bands: sounds at different points on these lines sound as if they’re equally loud, even though some are physically much more powerful:

Temporal Concepts

Rhytm generally means a “movement marked by the regulated succession of strong and weak elements, or of opposite or different conditions” (Wikipedia).
Tempo is “the speed or pace of a given piece or subsection thereof, how fast or slow” (Wikipedia).
Beat is “the basic unit of time, [or] the pulse (regularly repeating event)” (Wikipedia).

Higher-Level Concepts

A melody is “a linear succession of musical tones that the listener perceives as a single entity” (Wikipedia).
Harmony “considers the process by which the composition of individual sounds, or superpositions of sounds, is analyzed by hearing. Usually, this means simultaneously occurring frequencies, pitches (tones, notes), or chords” (Wikipedia).
Structure “can be found at the level of part of a work, the entire work, or a group of works. Elements of music such as pitch, duration and timbre combine into small elements like motifs and phrases, and these in turn combine in larger structures” (Wikipedia).

Mid-Level Feature Representations

A set of Mel-Frequency Cepstral Coefficients, or MFCCs, is a descriptor of a short piece of audio.
- “In sound processing, the mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency” (Wikipedia).
A chroma refers to the “color” of a sound, and maps it to one of the twelve different pitch classes.
- Unlike MFCCs, chromas are timbre-independent.

Structure Analysis

An onset refers to the “attack” of a sound. Tracking these over time produces inter-onset intervals (IOIs), which can be used to perform autocorrelation on a piece of audio to measure its self-similarity.

Autocorrolation is similar to crosscorrelation, except that the vector is correlated with itself.

Case Studies

Many tasks in music information retrieval involve finding music that is, to some degree, similar to the input piece. Note that all of the following are clearly information retrieval tasks, since they take a query as input against the (relatively) static collection of all music in the world, and produce a (ranked) list of matching pieces of music.

Retrieval Target	Shared Aspect	Matching Specificity	Case Study
(Compressed/ enhanced) digital copies	Performance instance	Exact	Fingerprinting
Semantic gap	Semantic gap	Semantic gap	Semantic gap
Cover songs	Underlying musical work	Approximate	Cover Song Retrieval
Similar songs	Performer/ composer	Global characteristics
Similar songs	Genre	Global characteristics	Genre Classification

The semantic gap indicates that, at this point, the retrieval task becomes more subjective, and therefore also more difficult.

Fingerprinting

Audio fingerprinting involves finding exact matches of songs based on short samples (commercially: think Shazam and SoundHound) in a way that is robust to noise and fast.

The basic concept of fingerprinting is to create a database of small, easily searchable representations of data points (in this application, songs). When a new song fragment comes in, the system takes its fingerprint and compares it to the rest of the database to see if there’s a match.

At the core of this is finding a robust fingerprinting technique. One such technique is the Philips fingerprinting algorithm (.pdf).

Cover Song Retrieval

Cover songs may differ from their originals in tempo and tembre. There are several techniques to counter this, such as looking at minimum edit distances using dynamic programming.

Genre Classification

Genre classification uses a bag-of-frames approach similar to the bag-of-words approach in image processing. See classification.

Music & Multimedia

Music can also be a multimedia experience. Sadly, the (very cool) examples shown in the lecture do not translate well to a markdown file.

[ Home ]

ti-2716-c-notes

Notes for the TU Delft TI2716-C Multimedia Analysis course (Q3 2016-2017)