Classification
Classification is one of the common types of tasks in multimedia analysis and machine learning. It takes a data point as input, and assigns it a class label as output.
In this chapter, we look at music genre classification using the bag of frames approach, and the language model as a naive baysian classifier for text.
Genre Classification Using Bag of Frames
We use a bag of frames approach and first create a model. We collect audio content for each genre, and apply the following steps:
- Split labeled audio content into short segments that are assumed to be stable.
- Apply a feature descriptor such as an MFCC to each segment to get feature vectors.
- Create a bag of audio vectors from the audio descriptors.
- This is a “bag of” audio vectors because the time information is lost.
- Using the means and variances of the positions of these vectors in n-dimensional space, create a representative multivariate gaussian model for the genre.
- Sometimes we also use a mixed gaussian model if our data has multiple clusters in some dimensions. (Think of the heights of humans - there will be a peak at the average child’s height, average woman’s height, and average man’s height.)
Once we have trained the model:
- We can calculate the audio vectors for small segments of a song to classify
- For each segment’s vector and each genre’s gaussian model, we calculate the probability that that gaussian model produced the vector. 1. We create an average score per song.
- The classifier output for a song isthe genre class corresponding to the class with the highest score.
Naive Bayes’ Classifier for Text
Using the unigram language model, we can calculate the probability that some source material “generated” a particular sentence.
An obvious extension to this is, instead of returning a ranked list of possible results, classifying the sentence as belonging to the source with the highest probability.
[ Home ]