Step by step building a multi-class text classification model with Keras

Originally published on Medium

NLP

Natural Language Processing or NLP, for short, is a combination of the fields of linguistics and computer science. It can be defined as the method used by computers to try to understand the natural language of humans and be able to interact with them. Most NLP techniques nowadays depend on machine learning algorithms. The data to be analyzed by the machine could be in one of two forms.

  • Video/Audio (e.g. a human talking to a machine)
  • Text (e.g. articles)

Difficulties

Although NLP is one of the hottest topics in computer science in the last couple of decades, there are still many difficulties facing the algorithms in understanding the rules of natural languages and understanding the intent of the person from what they are saying, which we as humans sometimes struggle with also. For example, People might still get confused by sarcasm in texts.

One of the problems that NLP algorithms encounter is the ambiguity and imprecise characteristics of natural languages; sarcasm and idioms are excellent examples of this. Humans can detect sarcasm from the text’s context, but this task is not particularly easy for a machine. Giving an appropriate response to a sentence that holds an entirely different meaning than what it is supposed to mean, is not an easy task. Another problem is understanding the tenses of the sentences(present, past, future), which play an essential role in defining the intent of the person using them.

Text classification

Text classification is one of the most important applications for NLP nowadays. It is can be used for sentiment analysis (binary text classification) or it’s big brother Emotion detection (multi-class classification). We will be using Emotion detection as an example in this article.

A good dataset to use in this task is the "Emotion Intensity in Tweets" data set from the WASSA 2017 shared task

In this dataset, we have 4 different files representing 4 different emotions:

  • Fear
  • Anger
  • Joy
  • Sadness

So let’s start with our task. A good way to ensure that you have a good environment, with all dependency versions satisfied, is using conda environments. You can download Anaconda and the conda cli using this link (https://www.anaconda.com/products/individual). After you follow the installation instructions, you will be able to use the conda cli.

Now we can install a tensorflow environment using conda that contains all the libraries and files that we need to start using tensorflow and keras. There are two types of tensorflow environments:

  • tensorflow: runs tensorflow on CPU
  • tensorflow-gpu: runs tensorflow on GPU

We will create a tensorflow-gpu environment so we can have better performance using the following command:

conda create --name tf_gpu tensorflow-gpu

tf_gpu is the name tag for the created environment. We now need to activate the environment, so we run

conda activate tf_gpu

Now you can start using tensorflow and keras on our GPU without worrying about version dependencies.

After installing the environment we need now to load our data from the files and we will be using the pandas library to do that. We will install the library using pip:

pip install pandas

We will load a file to a dataframe, which makes it easier to manipulate columns and rows if we need to. This is done as follows:

import pandas as pd

fear = pd.read_csv("fear.csv",sep=',',usecols=cols)

fear = fear.dropna()

In the first line, we load the data from the csv file using the read_csv function in pandas. The first argument is the file path, the second argument identifies the separator between columns and since the file is of type csv then the columns are separated by the ',' character. The usecols argument is an array of the column names that you would like to load. So if you have the columns (id, name, text, value) and you only want to load the text and value columns, cols will be ["text", "value"]. fear.dropna() just drops the rows that have some entries missing. We will use the same block to load the four other files.

Now that we loaded our data, we need to perform some text preprocessing to remove the unwanted words and characters from our text. For this part, we will be using the nltk library, which is a great library to do text manipulations. First we install nltk using ‘pip’

pip install nltk

There are basic steps to start text preprocessing and it goes as follows

Tokenization

Tokenization is the process of cutting the sentence into tokens (words). So if you have the sentence "I love playing football", the result of the tokenization process will be the array ["I", "love", "playing", "football"]. This can be done using nltk as follows:

tokens = nltk.word_tokenize(text)

Removing stop words

Stop words are a set of commonly used words in any language. For example, in English, "the", "is" and "and", would easily qualify as stop words. Due to the abundance of these words in a sentence, they do not contribute to a NLP task with useful information so removing them decreasing the size of the sentence which in turn decreases processing time needed for a task. This task can be done using nltk also:

from nltk.corpus import stopwords

nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

clean_sentence = [w for w in words if not w in stop_words]

Stemming and Lemmatization

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

  • Stemming: reduces the word to its stem. For example, the word "cooking" becomes "cook" and the word "Played" becomes "play".
  • Lemmatization: returns the word to its lemma. For example, the word "is" becomes "be".

Lemmatization requires more computational power since it needs to know the part of speech tagging of the words and requires more work. These processes can also be done using nltk.

from nltk.stem import WordNetLemmatizer

from nltk.stem import PorterStemmer

nltk.download('wordnet')

nltk.download('averaged_perceptron_tagger')

nltk.download('gutenberg')

stemmer = PorterStemmer()

stemmer.stem(sentence)

wordnet_lemmatizer = WordNetLemmatizer()

wordnet_lemmatizer.lemmatize(sentence)

Extra preprocessing

After doing all the previous preprocessing, there might be some more preprocessing to be done or more words that you need to remove. For example, if our data comes from twitter then we might run into a sentence like "@felix I saw your post" and want to remove the mention from the sentences. We can do that by using the re library in python and you also need to know how to express what you want in regex form. The following is an example of unwanted texts being removed.

emails = '[A-Za-z0-9]+@[a-zA-z].[a-zA-Z]+'

websites = '(http[s]*:[/][/])[a-zA-Z0-9]+'

mentions = '@[A-Za-z0-9]+'

sentence = re.sub(emails, '', sentence)

sentence = re.sub(websites, '', sentence)

sentence = re.sub(mentions, '', sentence)

Now that we have done our preprocessing and cleaned our sentences, we need to pass these sentences and their labels to our model and start training.

To build our model we will be using keras. Keras is an open-source neural network library written in Python. It can run on top of multiple frameworks like tensorflow and pytorch. We will be using tensorflow as our backend framework.

There are two types of neural networks that are mainly used in text classification tasks, those are CNN and LSTM. Most models consist either of one of them or a combination of both. We will be using a combination of both here. The following diagram shows our model.

model
Model
building_block
Building Block

This model consists of

  1. Embedding Layer: responsible for the word embedding (we will be using the spacy library for this).
  2. Spatial Dropout: Decreasing the number of features that we train on
  3. CNN
  4. Leaky Relu : so we do not have to deal with dead ReLUs
  5. Max pool: focus on the important features only
  6. BLSTM (Bi-Directional LSTM): A variant of the LSTM That uses two LSTMs one forward and one backward.
  7. Softmax layer : classification layer

pip install spacy

python -m spacy download en_core_web_lg

Now we need to create a list with all of our vocabularies and we need to change the texts to numbers that are representative of them (embedding layers deal with numbers only not texts). A good tool to do that with is keras’s Tokenizer class, it has all the tools that you might need for this task.

from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words=100000)

tokenizer.fit_on_texts(texts)

word_index = tokenizer.word_index

num_words is the maximum number of words you want in the vocabulary list. This can differ depending on your task and your data exploratory analysis. texts is a list containing all our sentences. word_index is a list containing the number representation of our texts. The tokenizer variable should never be changed or initialized throughout the process from here on because the word_index list and the inner class variables need to be consistent throughout the process. We now can create an embedding list.

text_embedding = np.zeros((len(word_index) + 1, 300))
for word, i in word_index.items():
   text_embedding[i] = nlp(word).vector

Now that we have our embeddings we can start building our model using keras.

from tensorflow.keras.models import Sequential,Model

from tensorflow.keras.layers import Dense, LSTM, Embedding,Dropout,SpatialDropout1D,Conv1D,MaxPooling1D,GRU,BatchNormalization

from tensorflow.keras.layers import Input,Bidirectional,GlobalAveragePooling1D,GlobalMaxPooling1D,concatenate,LeakyReLU

from tensorflow.keras import regularizers

from tensorflow.keras import backend as K

model = Sequential()

model.add(Embedding(input_dim=text_embedding.shape[0], output_dim=text_embedding.shape[1], weights=[text_embedding], input_length=MAX_SEQUENCE_LENGTH, trainable=False))

model.add(SpatialDropout1D(0.5))

model.add(Conv1D(filters, kernel_size=kernel_size,kernel_regularizer=regularizers.l2(0.00001), padding='same'))

model.add(LeakyReLU(alpha=0.2))

model.add(MaxPooling1D(pool_size=2))

model.add(Bidirectional(LSTM(lstm_units,dropout=0.5, recurrent_dropout=0.5,return_sequences=True)))

model.add(SpatialDropout1D(0.5))

model.add(Conv1D(filters, kernel_size=kernel_size,kernel_regularizer=regularizers.l2(0.00001), padding='same'))

model.add(LeakyReLU(alpha=0.2))

model.add(MaxPooling1D(pool_size=2))

model.add(Bidirectional(LSTM(lstm_units,dropout=0.5, recurrent_dropout=0.5,return_sequences=True)))

model.add(SpatialDropout1D(0.5))

model.add(Conv1D(filters, kernel_size=kernel_size,kernel_regularizer=regularizers.l2(0.00001), padding='same'))

model.add(LeakyReLU(alpha=0.2))

model.add(MaxPooling1D(pool_size=2))

model.add(Bidirectional(LSTM(lstm_units,dropout=0.5, recurrent_dropout=0.5)))

model.add(Dense(4,activation='softmax'))

model.compile(optimizer='adam',loss='categorical_crossentropy', metrics=['accuracy'])

Now we have built our model, it’s time we fit the model to our data and start training. Before training we need to convert our labels to a number vector and to split our data into training and test sets. We can use the function train_test_split from the sklearn library to split the data and use the to_categorical function from keras to convert the labels to a vector.

from sklearn.model_selection import train_test_split
from tensorflow.keras.utils import to_categorical

categorical_labels = to_categorical(labels,num_classes=4)

X_train, X_test, Y_train, Y_test = train_test_split(texts, categorical_labels, test_size=0.2)

Now we start fitting

model.fit(
  pad_sequences(
    tokenizer.texts_to_sequences(X_train), maxlen = MAX_SEQUENCE_LENGTH
  ),
  Y_train, batch_size = 512, epochs = 10,
  validation_data = (
    pad_sequences(
      tokenizer.texts_to_sequences(X_test), maxlen = MAX_SEQUENCE_LENGTH
    ),
    Y_test
  ),
  callbacks = callbacks_list, shuffle = True
)

We are using the same tokenizer we used previously to transform the texts from sequence of words to sequence of numbers, that’s why we needed to keep it consistent throughout the process. We are using the pad_sequneces function from tensorflow.keras.preprocessing.sequence because the model needs constant sequence length. So the value of MAX_SEQUENCE_LENGTH should be the length of the largest sequence in your dataset. After the model finishes training we can test it on some texts to see how well it performs.

result = model.predict_on_batch(pad_sequences(tokenizer.texts_to_sequences([
  "What happened 2 ur vegan food options?! At least say on ur site so i know I won't be able 2 eat anything for next 6 hrs #fail",
  "I am really scared of the future",
  "everything is great, I am doing awesome"
]), maxlen=MAX_SEQUENCE_LENGTH))

print("result: ", np.argmax(result, axis=-1), "\n")

Congratulations!!! You have built a multi-class text classifier now and you can use it to predict whatever classes you want for your texts.


About the author

Omar Elbadrawi

AI Engineer at Design AI
Omar Elbadrawi is part of the AI Engineering team of Design AI, a start-up focusing on agile AI development and use case identification through Design Thinking. He holds a M.Sc. in Data Engineering and Analytics from the Technical University of Munich. Within Design AI, he is currently team lead for a research project on Graph Neural Networks.