Noise can be defined as any kind of difference in the surface form of an electronic text from the original, intended or actual text. The text used in the short message service (SMS) and on-line forums like twitter, chat and discussion boards and social networking sites is often distorted mainly because the recipients can very well understand the shorter form of the longer words and also reduces the time and effort of the sender. Most of the text is created and stored so that humans can understand it, and it is not always easy for a computer to process that text.
With the increase in noisy text data generated in various social communication media, cleansing of such text has become necessary and also because the of-the-shelf NLP techniques generally fail to work because of several reasons like sparsity, outof-
vocabulary words and irregular syntactic structures in such texts.
A few of the cleaning techniques are:
- Removing stop words (deleting very common words like "a", "the", "and", etc.).
- Stemming (ways of combining words that have the same linguistic root or stem).
Asked In: Many Interviews |