Information Retrieval Using the Unigram Language Model

The vector space model is another way to implement the document ranking or retrieval functions in information retrieval.

Language Models

A language model assigns a probability to a sentence. Some of its applications include:

A language model always represents some human language source, in the form of text produced by a certain person, on a certain topic, or for a certain reason. The model calculates the probability that some string of words was “generated by” this source.

To train a language model, you need to:

  1. Choose a language source.
    • e.g. Jane Austen.
  2. Choose a training set.
    • e.g. books by Jane Austen.
  3. Determine the vocabulary.
    • e.g. all distinct words in the training set.
  4. Estimate the necessary probabilities.
    • e.g. the count of a word divided by the total word count.

Unigram Language Model

In a unigram language model, we calculate the probability of a sentence by simply multiplying the model’s probabilities for the words in the sentence together.

We train one language model for each document in the collection. We can then take the user’s query and calculate the probability that it was generated by each of the models. The n documents with the highest probability of having generated the sentence are then the top-n ranking / results list.

[ Home ]