

We can use python to do many text preprocessing operations. For Spam Filtering we may follow all the above steps but may not for language translation problem.

Note that not all the steps are mandatory and is based on the application use case. For detailed discussion on Stemming & Lemmatization refer here. Thus stemming & lemmatization help reduce words like ‘studies’, ‘studying’ to a common base form or root word ‘study’. The stemmed form of studies is: studi The stemmed form of studying is: study The lemmatized form of studies is: study The lemmatized form of studying is: study Lemmatization - Another approach to remove inflection by determining the part of speech and utilizing detailed database of the language.Stemming - words are reduced to a root by removing inflection through dropping unnecessary characters, usually a suffix.Removing stop words - frequent words such as ”the”, ”is”, etc.Tokenization - convert sentences to words.In this article, we will discuss the steps involved in text processing. Transforming text into something an algorithm can digest is a complicated process. These applications deal with huge amount of text to perform classification or translation and involves a lot of work on the back end.
