- A common pre-processing step is to normalize/rescale inputs so that they are not too high or low.
However, even on normalized inputs, overflows and underflows can occur:
- Underflow: Joint probability distribution often involves multiplying small individual probabilities. Many probabilistic algorithms involve multiplying probabilities of individual data points that leads to underflow. Example : Suppose you have 1000 data points, where the probability of each is < 1 lets say around 0.8, we have 0.8 ^ 1000 = 1.2302319e-97 which is close to 0. This is underflow.
A common way to combat this is to work in the log probability space: http://blog.smola.org/post/987977550/log-probabilities-semirings-and-floating-point
- Overflow: Imagine you have a deep network, error gradients an keep accumulating and often become vary large gradients. This results in an overflow where the values of the gradients become NAN. Weight regularization and gradient clipping are some common ways of dealing with this problem.