If you don’t have a stop-word dictionary or are working on a new language, what approach would you take to remove stop words?

Spread the Knowledge

TF-IDF (term frequency Inverse document frequency) is a popular approach that can be leveraged to eliminate stop words. This technique is language independent.

The intuition here is that commonly occurring words, that occur in almost all documents are stop words. On the other hand, words that occur commonly, but only in some of the documents are useful words that enable identifying documents.

Take a look at the following question for more details:

What is the formula for tf.idf ? Why do we use ‘log’ in idf formula ?


Spread the Knowledge