Information Retrieval Using the Vector Space Model

The vector space model is one way to implement the document ranking or retrieval functions in information retrieval.

The Term-Document Matrix

Analogous to the bag-of-words model in text, we have the term-document matrix that contains the frequencies f_{t,d} at which each term t occurs in each document d.

Such a term-document matrix looks like this, where f_{t,d} is some nonnegative integer:

  Document 1 Document 2 Document n
Term 1 f_{1,1} f{1, 2} f{1, n}
Term 2 f_{2,1} f{2, 2} f{2, n}
Term 3 f_{3,1} f{3, 2} f{3, n}
Term m f_{m,1} f{m, 2} f{m, n}

We represent documents as a set of indexing terms that “[capture] the essence of the topic of a document” (Wikipedia).

The Vector Space Model

The vector space model is an algebraic model for representing multimedia data as vectors of identifiers, such as index terms (Wikipedia).

It’s then possible to also translate a user’s query into such a vector, and to calculate the distance between each document in the document store (“document representation” in the diagram above). The n documents with the smallest distance from the user’s query are then the top-n ranking / result list.

Beyond Simple Term Frequency

There are several techniques beyond just comparing documents by their term frequency vectors, such as:

As usual, the system goals determine which of these techniques are appropriate to use.


[ Home ]