What will happen if you do not convert all characters to a single case (either lower or upper) during the pre-processing step of an NLP algorithm?

Spread the Knowledge

When all words are not converted  to a single case, the vocabulary size will increase drastically as words like Up/up or Fast/fast or This/this will be treated differently which isn’t a correct behaviour for the NLP task.

Sparsity is higher when building the language model since the cat is  treated differently from The cat. Suppose we are building an ngram model, we might end up with many ngrams in test set that never appeared in the training set.

Note: There are some situations where we do not want to do case correction too ! For instance, if you are working on tasks such as sentiment (“TERRIBLE” probably sounds worse than “terrible” ) or text generation, the case might play an important role and might be best not to normalize it.


Spread the Knowledge