Natural Language Processing or NLP, for short, is a combination of the fields of linguistics and computer science. It can be defined as the method used by computers to try to understand the natural language of humans and be able
to interact with them. Most NLP techniques nowadays depend on machine learning algorithms. The data to be analyzed by the machine could be in one of two forms.
Video/Audio (e.g. a human talking to a machine)
Text (e.g. articles)
Although NLP is one of the hottest topics in computer science in the last couple of decades, there are still many difficulties facing the algorithms in understanding the rules of natural languages and understanding the intent of
the person from what they are saying, which we as humans sometimes struggle with also. For example, People might still get confused by sarcasm in texts.
One of the problems that NLP algorithms encounter is the ambiguity and imprecise characteristics of natural languages; sarcasm and idioms are excellent examples of this. Humans can detect sarcasm from the text’s
context, but this task is not particularly easy for a machine. Giving an appropriate response to a sentence that holds an entirely different meaning than what it is supposed to mean, is not an easy task. Another problem is understanding the
tenses of the sentences(present, past, future), which play an essential role in defining the intent of the person using them.
Text classification is one of the most important applications for NLP nowadays. It is can be used for sentiment analysis (binary text classification) or it’s big brother Emotion detection (multi-class classification). We will be using Emotion
detection as an example in this article.
A good dataset to use in this task is the "Emotion Intensity in Tweets" data set from the WASSA 2017 shared task
In this dataset, we have 4 different files representing 4 different emotions:
So let’s start with our task. A good way to ensure that you have a good environment, with all dependency versions satisfied, is using conda environments. You can download Anaconda and the conda cli using this link (https://www.anaconda.com/products/individual). After you follow the installation instructions, you will be able to use the conda cli.
Now we can install a tensorflow environment using conda that contains all the libraries and files that we need to start using tensorflow and keras. There are two types of tensorflow environments:
tensorflow: runs tensorflow on CPU
tensorflow-gpu: runs tensorflow on GPU
We will create a tensorflow-gpu environment so we can have better performance using the following command:
tf_gpu is the name tag for the created environment. We now need to activate the environment, so we run
Now you can start using tensorflow and keras on our GPU without worrying about version dependencies.
After installing the environment we need now to load our data from the files and we will be using the pandas library to do that. We will install the library using pip:
We will load a file to a dataframe, which makes it easier to manipulate columns and rows if we need to. This is done as follows:
In the first line, we load the data from the csv file using the read_csv function in pandas. The first argument is the file path, the second argument identifies the separator between columns and since the file is of type csv then the columns are
separated by the ',' character. The usecols argument is an array of the column names that you would like to load. So if you have the columns (id, name, text, value) and you only want to load the text and value
columns, cols will be ["text","value"]. fear.dropna() just drops the rows that have some entries missing. We will use the same block to load the four other files.
Now that we loaded our data, we need to perform some text preprocessing to remove the unwanted words and characters from our text. For this part, we will be using the nltk library, which is a great library to do text manipulations. First we
install nltk using ‘pip’
There are basic steps to start text preprocessing and it goes as follows
Tokenization is the process of cutting the sentence into tokens (words). So if you have the sentence "I love playing football", the result of the tokenization process will be the array ["I","love","playing","football"]. This can be done using
nltk as follows:
Removing stop words
Stop words are a set of commonly used words in any language. For example, in English, "the", "is" and "and", would easily qualify as stop words. Due to the abundance of these words in a sentence,
they do not contribute to a NLP task with useful information so removing them decreasing the size of the sentence which in turn decreases processing time needed for a task. This task can be done using nltk also:
Stemming and Lemmatization
The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.
Stemming: reduces the word to its stem. For example, the word "cooking" becomes "cook" and the word "Played" becomes "play".
Lemmatization: returns the word to its lemma. For example, the word "is" becomes "be".
Lemmatization requires more computational power since it needs to know the part of speech tagging of the words and requires more work. These processes can also be done using nltk.
After doing all the previous preprocessing, there might be some more preprocessing to be done or more words that you need to remove. For example, if our data comes from twitter then we might run into a sentence like "@felix I saw your post" and want to remove the mention from the sentences. We can do that by using the re library in python and you also need to know how to express what you want in regex form. The following is an example of unwanted texts being removed.
Now that we have done our preprocessing and cleaned our sentences, we need to pass these sentences and their labels to our model and start training.
To build our model we will be using keras. Keras is an open-source neural network library written in Python. It can run on top of multiple frameworks like tensorflow and pytorch. We will be using tensorflow as our backend
There are two types of neural networks that are mainly used in text classification tasks, those are CNN and LSTM. Most models consist either of one of them or a combination of both. We will be using a combination of both here. The following
diagram shows our model.
This model consists of
Embedding Layer: responsible for the word embedding (we will be using the spacy library for this).
Spatial Dropout: Decreasing the number of features that we train on
Leaky Relu : so we do not have to deal with dead ReLUs
Max pool: focus on the important features only
BLSTM (Bi-Directional LSTM): A variant of the LSTM That uses two LSTMs one forward and one backward.
Softmax layer : classification layer
Now we need to create a list with all of our vocabularies and we need to change the texts to numbers that are representative of them (embedding layers deal with numbers only not texts).
A good tool to do that with is keras’s Tokenizer class, it has all the tools that you might need for this task.
num_words is the maximum number of words you want in the vocabulary list. This can differ depending on your task and your data exploratory analysis. texts is a list containing all our sentences. word_index is a list containing the number representation of our texts. The tokenizer variable should never be changed or initialized throughout the process from here on because the word_index list and the inner class variables need to be
consistent throughout the process. We now can create an embedding list.
Now that we have our embeddings we can start building our model using keras.
Now we have built our model, it’s time we fit the model to our data and start training. Before training we need to convert our labels to a number vector and to split our data into training and test sets. We can use the function train_test_split
from the sklearn library to split the data and use the to_categorical function from keras to convert the labels to a vector.
Now we start fitting
We are using the same tokenizer we used previously to transform the texts from sequence of words to sequence of numbers, that’s why we needed to keep it consistent throughout the process. We are using the pad_sequneces function from
tensorflow.keras.preprocessing.sequence because the model needs constant sequence length. So the value of MAX_SEQUENCE_LENGTH should be the length of the largest sequence in your dataset. After the model finishes training we can test it on
some texts to see how well it performs.
Congratulations!!! You have built a multi-class text classifier now and you can use it to predict whatever classes you want for your texts.
About the author
AI Engineer at Design AI
Omar Elbadrawi is part of the AI Engineering team of Design AI, a start-up focusing on agile AI development and use case identification through Design Thinking. He holds a M.Sc. in Data Engineering and Analytics from the Technical University of Munich. Within Design AI, he is currently team lead for a research project on Graph Neural Networks.