Say you’ve generated a language model using Bag of Words (BoW) with 1-hot encoding , and your training set has lot of sentences with the word “good” but none with the word “great”. Suppose I see sentence “Have a great day” p(great)=0.0 using this language model. How can you solve this problem leveraging the fact that good and great are similar words?
BoW with 1-hot encoding doesn’t capture the meaning of sentences, it only captures co-occurrence statistics. We need to build the language model using features which are representative of the meaning of the words.
A simple solution could be to cluster the word embeddings and group synonyms into a unique token. Alternately, when a word has zero probability, try to look for the probability of a synonym instead.
A more principled approach is to Build a language model using Distributed representations like probabilistic neural language model. (https://papers.nips.cc/paper/1839-a-neural-probabilistic-language-model.pdf)
Other workarounds for the zero probability problem involve various kinds of smoothing, though they do not leverage the semantic closeness of similar words.