Why is smoothing applied in language model ?

Because there might be some n-grams in the test set but may not be present in the training set. For ex., If the training corpus is
$w_{train}\,=\,This\,is\,the\,only\,sentence\,in\,the\,corpus$

and you need to find the probability of a sequence like

$w_{test}\,=\,This\,is\,the\,sentence\,in\,the\,corpus$

$p(w_{test}) = p(this|<START>)*p(is|this)*...*p(sentence|the)*...*p(corpus|the)$

where <START> is the token applied at the beginning of the document.

Then

$p(sentence | the) = 0.0$

as bi-gram “the sentence” doesn’t occur in the training set, but the test sequence is highly probable given the training set. To avoid such situations, add-k or other type of smoothing techniques are used such that any conditional probability is non-zero.

A related question could be this.

Leave a Reply Cancel reply