Evaluating Multimedia Systems

Before diving into the specifics of different types of multimedia systems, we must first know how to evaluate them. This chapter covers the evaluation of supervised MMA tasks using hold out evaluation, cross-validation, and several evaluation metrics, and the evaluation of information retrieval tasks.

We evaluate multimedia systems to objectively determine how well they perform - “just playing” with a system (and its parameters) isn’t good enough.

Evaluating Supervised MMA Tasks

In many supervised multimedia analysis and machine learning tasks, such as classification, there is a training step.

These two settings (training and evaluation) are analogous, and require the same data: labeled inputs. It’s also very important that the training and evaluation sets are representative of each other, to avoid creating a system that is not optimized for the data it’s going to be evaluated on.

So the “trick” to evaluating multimedia systems in a supervised setting is in deciding how to split the underlying dataset.

Hold-Out Evaluation

Hold-out evaluation involves training your system on one part of the data and evaluating it on another part.

  1. Randomly divide the data set into a mutually exclusive training set and a testing set.
    • The fraction of data in the training set is called the training factor. A training factor of 0.8 (common) means 80% of the data is in the training set.
    • Never look at the testing set before it’s time to evaluate
  2. Train the system using the training set.
  3. Run the testing set through the system and calculate an evaluation metric on the results.

Cross-Validation

Repeat steps 1 - 3 for hold-out evaluation several times, and average the values for the evaluation metric to get the final evaluation.

Evaluation Metrics

Most evaluation metrics are calculated from how much of the testing data labels fall into the following categories:

We can then calculate any of the following metrics:

It is up to the system designer (you!) to decide what metrics are most important to the system. Make sure to decide (and document!) this decision before running tests.

Evaluating Unsupervised MMA Tasks

Evaluating unsupervised multimedia analysis tasks, such as a clustering or a segmentation, are out of scope for TI2716-C.

Evaluating Information Retrieval Systems

Information retrieval systems have a fairly static, large document collection, but do not really have the concept of a training set.

Instead, we can write several queries and define the relevant items for those queries. We can then evaluate these queries by comparing their retrieved items to the relevant items. We define the precision and recall “at N”:


[ Home ]