Out of vocabulary words are words that are not in the training set, but appear in the test set, real data. The main problem is that the model assigns a probability zero to out of vocabulary words resulting in a zero likelihood. This is a common problem, specially when you have not trained on a smaller data set. There are many techniques to handle out of vocabulary words :
- Typically a special out of vocabulary token is added to the language model. Often the first word in the document is treated as the out of vocab word ensure the out of vocab words occurs somewhere in the training data and gets a positive probability.
- Smoothing is the common technique applied in language models, where we add a constant term in the numerator and denominator to while estimating probabilities of words to ensure none of the probabilities go to 0. See https://vitalflux.com/quick-introduction-smoothing-techniques-language-models/ or https://medium.com/@theflyingmantis/basics-of-nlp-2-266748a40a3a for more details. This trick can be applied to unigram as well as to n-gram smoothing.
- Another common trick, particularly when working with word embedding based solutions is to replace the word with a nearby word from some form of synonym dictionary. Example : ‘I want to know what you are consuming’. Suppose consuming is not in the vocabulary, replace it with ‘I want to know what you are eating’. Take a look at the following article for more details. https://medium.com/cisco-emerge/creating-semantic-representations-of-out-of-vocabulary-words-for-common-nlp-tasks-842dbdafba18