Text Classification — A deep learning methodology that is used in text analytics

Poojadurai
6 min readMay 19, 2021

Natural Language processing is a broad area in the world of Artificial Intelligence. It serves several purposes like topic modelling, text classification, Named Entity Recognition (NER) extraction, Question Answering etc. One of the most commonly used techniques in real time is text classification due to its various applications. Text is an extremely rich source of information. But extracting insights from text can be challenging and time-consuming, due to its unstructured nature. Text classification can be performed either through manual annotation or by automatic labelling. With the growing scale of text data in industrial applications, automatic text classification is becoming increasingly important. Some of the applications are spam detection, sentiment analysis and news categorization.

Let us take an example to understand text classification better. Assume person X has joined the U.S Postal Service Office as a new analyst. He has been asked to analyze some of the statements made by political figures to help undecided voters make a choice during the time of election. He has been given a chunk of old data that has been annotated by his trusted peers. The annotated labels are true, mostly true, half true, barely true, false and pants on fire. Now they want him to experiment on this dataset and predict the nature of the statements. This is where text classification algorithm comes into picture. This scenario is an example of sentiment analysis. Now this analysis will really be helpful as there will always be a pattern in the statements made by political figures and it is predictable based on the old data

Deep learning-based models have surpassed classical machine learning based approaches in various text classification tasks. Here are few commonly used deep learning approaches,

· Feed-forward networks view text as a bag of words

· RNN-based models view text as a sequence of words, and are intended to capture word dependencies and text structures

· CNN-based models are trained to recognize patterns in text, such as key phrases

· The attention mechanism is effective to identify correlated words in text,

· Memory-augmented networks combine neural networks with a form of external memory, which the models can read from and write to

· Graph neural networks are designed to capture internal graph structures of natural language, such as syntactic and semantic parse

· Siamese Neural Networks are designed for text matching

· Hybrid models combine attention, RNNs, CNNs, etc. to capture local and global features of sentences and

· Transformers allow for much more parallelization than RNNs, making it possible to efficiently train very big language models using GPU

The first and foremost step needed to build a text classification is dataset collection. Since the applications are various, the dataset needed for training the models are also various. The common datasets used for sentiment analysis are Yelp, IMDb, SST and Amazon. The common datasets used for news classification are AG news, Sogou news and 20 Newsgroups. Common topic classification datasets are PUBMED and DBpedia.

The step after the data collection is the preparation of the data to the form suitable for the classifier training. The standard techniques commonly used in the textual data preparation:

Tokenization: Splitting of strings into a tokens representing the lexical units

Lowercasing: Conversion of the entire text to lowercase (words with different cases map to the same lowercase form)

Punctuation removal: Removal of all punctuation marks within the sentence

Stop words removal (stop words are words with minimum information value, e.g., conjunctions, prepositions, confusions, etc.)

The feature representation of text documents is usually done using word embeddings. Some of the word embedding methods are

  • word2vec — provides direct access to vector representations of words. It is a combination of two techniques, two neural networks — Continuous bag of words (CBOW) and Skip-gram model
  • GloVe — is one of the newest methods for calculating the vector representation of words
  • fastText — is a library created by Facebook’s research team to learn and calculate the word representation and sentence classification. Its principle is to assign a vector representation to n-grams of characters that contain individual words

The next step is to find a suitable model for text classification. Large pretrained language models are definitely the main trend of the latest research advances NLP. The current models occupying NLP leader-board are XLNet from Carnegie Mellon University, ERNIE 2.0 from Baid and RoBERTa from Facebook AI.

XLNet is a method of learning language representation using the generalized autoregressive pretraining method. Its objective is to learn the language model. It has been trained on a large corpus using the permutation language modelling objective. It has surpassed BERT on various NLP tasks. it generates all permutations of words in a sentence and tries to maximize the likelihood of sequence.

Baidu released a continual natural language processing framework ERNIE 2.0. ERNIE stands for Enhanced Representation through kNowledge IntEgration. Baidu claims in its research paper that ERNIE 2.0 outperforms BERT and the recent XLNet in 16 NLP tasks in Chinese and English. ERNIE 2.0 is a continual pre-training framework. Continual learning aims to train the model with several tasks in sequence so that it remembers the previously learned tasks when learning the new ones.

ERNIE 2.0: An Article to (hopefully) Answer All Your Questions

RoBERTa stands for A Robustly Optimized BERT Pretraining approach. It is based on Google’s BERT model released in 2018. It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates.

There are set of metrics commonly used for evaluating text classification models’ performance. They are,

· Accuracy and Error Rate. These are primary metrics to evaluate the quality of a classification model. Let TP, FP, TN, FN denote true positive, false positive, true negative, and false negative, respectively

· Precision / Recall / F1 score. These are also primary metrics, and are more often used than accuracy or error rate for imbalanced test sets. The F1 score is the harmonic mean of the precision and recall. An F1 score reaches its best value at 1 and worst at 0

· Exact Match (EM). The exact match metric is a popular metric for question-answering systems, which measures the percentage of predictions that match any one of the ground-truth answers exactly

Other widely used metrics include Mean Average Precision (MAP), Area Under Curve (AUC), False Discovery Rate, False Omission Rate, to name a few.

Thus, we can conclude that deep learning methods are proving very good at text classification, achieving state-of-the-art results on a suite of standard academic benchmark problems. Stay tuned for more blogs!

--

--

Poojadurai

I am a curious learner who loves to explore the trending technologies and experiment them. I like to share my thoughts on latest topics.