Natural Language Processing - Emotion Detection with Multi-class, Multi-label Convolutional Neural Networks and Embedding

July 8, 2020

I came across a well-prepared dataset provided by Google, with 58 000 ‘carefully curated’ Reddit comments, labeled with one or more of 27 emotions, e.g. anger, confusion, love. Google had used this to train a BERT model, in which they had varying success in emotion detection depending on the type of comment. I thought it would be a great example to learn how to tweak Convolutional Neural Networks (CNNs) and Embedding to a Natural Language Processing (NLP) problem, and obtained decent accuracy for some emotions, but not as good of course as Google’s BERT.

This is not a tutorial - there are plenty of great tutorials on CNNs and Embedding out there. Instead, this is rather a complete code example, tackling multi-class multi-label classification, which was rather hard to find complete & free examples for.

Spoiler: My code doesn’t do as well as Google, who also provide their code in the above link. Their well-oiled BERT solution obtains around 46% F1 score, while I only obtain 42%. However, I didn’t spend too much time optimising my model, so there might easily be another 2 or 3 percent available through some optimisations.

More on the GoEmotions dataset

Firstly, kudos to the Google research team (actual paper here) for such a well-prepared and cleaned dataset, which you can obtain from the link.

The dataset, along with some other metadata, provides a reddit comment (“text”), along with a corresponding set of emotion labels:

The data

Starting intuition

This is a complicated domain - different users display sentiment in different ways, and many of these comments are very short and contain two meanings, e.g. “That’s adorable asf” Secondly, some comments, such as “This video doesn’t even show the shoes he was wearing”, are completely neutral and labelled as such.

However, there are definitely clearly words and phrases which are giveaways to certain emotions. Some, such as love or excitement, are probably quite easy to detect. However, subtle emotions such as pride or curiosity, less so. So I expect to get some success, but certainly don’t expect to do better than Google. After all, these comments aren’t responses to “How do you feel?” - they are seemingly arbitrarily picked from reddit.

The code, explained

The python code obtaining 42% F1 score on the dataset is here. Once you obtain the dataset from Google, you can run it out of the box just by changing the path to the datasets, assuming you have all dependencies installed. However, training the Neural Network will take about 30 minutes to train on 15 epochs depending on your computing power.

Below, I’ll describe each section, excluding the imports.

Fetching the data

This is standard. We are fetching the data CSVs (which are separate by Google’s design), concatenating them, and choosing our input vector, X, and binary output, y. We are then splitting the data into training and test data. The dataset is very large - almost 20,000 comments. Therefore, many other train_test_split splits would have worked, too, but we can go with the standard practice.

df1 = pd.read_csv('../../data/emotions/goemotions_1.csv')
df2 = pd.read_csv('../../data/emotions/goemotions_2.csv')
df3 = pd.read_csv('../../data/emotions/goemotions_3.csv')

frames = [df1, df2, df3]

df = pd.concat(frames)

X = df['text'].values
X= X.astype(str)
y = df.iloc[:,9:].values

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

After this, we now have roughly 20,000 mappings of comments (“I really love bitterballen!”) with their associated labelled emotions (admiration, amusement, anger, annoyance, approval, caring, confusion, curiosity, desire, disappointment, disapproval, disgust, embarrassment, excitement, fear, gratitude, grief, joy, love, nervousness, optimism, pride, realization, relief, remorse, sadness, surprise).

Tokenizing the data

In order to provide a valid input to the neural network, we need to tokenize each word in each row of the data. This means that rather than a word, we’ll have a mapping to a giant dictionary of Token:Word, e.g. a sentence “I really love bitterballen!” would tokenize 1:”i”, 2: “really”, 3: “love”, 4:”bitterballen”

Side note: Right away, there’s an improvement that could be made here, that I lose out on: The exclamation mark, or capitalisation of text, could be used to predict some strong emotions.

Rather than now represent each sentence in text, it would be a vector: [1, 2, 3, 4]. A second sentence “Bitterballen love really I!” would now vectorize to [4, 3, 2, 1], and a sentence with new words would add new values [5, 6, …] to the token list.

tokenizer = Tokenizer() 
tokenizer.fit_on_texts(X_train)

#https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/discussion/91240
num_words=9000
tokenizer.word_index = {e:i for e,i in tokenizer.word_index.items() if i <= num_words} # <= because tokenizer is 1 indexed
tokenizer.word_index[tokenizer.oov_token] = num_words + 1

X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)
vocab_size = len(tokenizer.word_index) + 1  # Adding 1 because of reserved 0 index

maxlen = 20
X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)

Embeddings

By tokenizing, we turn each row into a more mathematical vector of words, e.g. [1 52 3 10 3 0 0 0 0 0]. Being a vector, this is something that we can use as input to a ML model. However, there is an even better way to represent the data.

Embeddings are large pre-trained mappings of related words in an n-dimensional space. For example, as shown below, if we trained a word association model on a large amount of texts, one dimension of associated words it would pick up are on gender. In this dimension, the embedding would be a simple vector indicating that “Man is to woman” as “king is to queen.” When using an embedding, you enable the model you train to quickly make these distinctions. This helps the model train faster and more effectively, eliminating the confusion that synonyms and more abstract text would create. Your model can be more efficiently trained to consider these words x% similar, and produce similar outputs. For a more intuitive, in-depth explanation, try Will’s.

Gender, source https://nlp.stanford.edu/projects/glove/)

So imagine this image, but with 50 to 300 dimensions. Gender, colour, mood, and so on. That’s what Stanford’s Glove library provides, and is the embedding source used in this example. Google’s BERT uses pre-trained embeddings, too.

In the code below, there is a function create_embedding_matrix to process the embedding file into a matrix. This is shamelessly obtained from an excellent tutorial on text classification with keras.

def create_embedding_matrix(filepath, word_index, embedding_dim):
    vocab_size = len(word_index) + 1
    embedding_matrix = np.zeros((vocab_size, embedding_dim))

    with open(filepath,encoding='utf-8') as f:
        for line in f:
            word, *vector = line.split()
            if word in word_index:
                idx = word_index[word]
                embedding_matrix[idx] = np.array(
                    vector, dtype=np.float32)[:embedding_dim]

    return embedding_matrix
    
embedding_dim = 300
embedding_matrix = create_embedding_matrix('../../data/embedding/glove/glove.6B.300d.txt', tokenizer.word_index, embedding_dim)

nonzero_elements = np.count_nonzero(np.count_nonzero(embedding_matrix, axis=1))
embedding_accuracy = nonzero_elements / vocab_size
print('embedding accuracy: ' + str(embedding_accuracy))

Here, there are only two choices to explain:

So, as a result of embedding, rather than an input array mapping some sentences to text

[1 52 3 10 3 0 0 0 0 0]

We instead have a huge matrix mapping each word to dimensional values! So rather than just a list of words, we essentially have a matrix of meanings for each word in our corpus.

[0  dim_1 dim_2 ... dim_n
 1 dim_1 dim_2 ... dim_n
 2  dim_1 dim_2 ... dim_n
 ...
 9000  0     0     ... 0   ]

with each dim_ value being an actual representation of that word’s value in a dimension, e.g. gender.

[0  0.1 0.8 ... 0.5
 1 0.3 0.2 ... 0.0
 2  0.0 0.4 ... 0.1
 ...
 9000  0   0   ... 0   ]

The neural network

model = Sequential()
model.add(layers.Embedding(vocab_size, embedding_dim, weights=[embedding_matrix], input_length=maxlen, trainable=True))

Firstly, we add an embedding layer to accept the embedding matrix we just made. Trainable = True here because 1. the Glove embedding is not trained on the last 5% of words in our corpus, which may be super useful for the model and 2. There might be differences in embedding dimensions in reddit language use, which can be modified in real-time while training.

model.add(layers.Conv1D(256, 3, activation='relu'))

Secondly, a convolutional layer to turn this into a CNN. This essentially behaves as a sliding layer of 3 words looking for 256 different filters. In other words, it trains the model to look for 256 different ‘meanings,’ sliding over the text bit by bit. Whilst I did try different values for both, this amount worked well probably because:

model.add(Dropout(0.2))

A dropout layer (Here, a regularizer would have been an option too). I found that the network, with such a large dataset, was overfitting right after one epoch. To prevent this, the dropout layer simply deactivates a random 20% of the nodes per batch, making sure we don’t rely too much on one node in the network. This decreases overfitting, and most models now do their best after around 10 epochs as opposed to 1. This works hand-in-hand with learning rate, and the amount of epochs chosen.

model.add(layers.GlobalMaxPooling1D())

This is standard practice in CNNs, now reducing the amount of incoming feature vectors, taking only the maximum ones for each dimension.

model.add(layers.Dense(112, activation='relu'))
model.add(layers.Dense(28, activation='sigmoid'))
opt = optimizers.Adam(lr=0.0002)
model.compile(optimizer=opt, loss='binary_crossentropy')
model.summary()

Again, standard practice to have another small dense relu layer before a sigmoid output layer for classification. Finally, the model is combined with an adam optimizer and binary_crossentropy loss function, both which are your bread and butter for binary classification neural networks, and also for multi-label, multi-class classification. A learning rate of 0.0002 was chosen through trial and error. As mentioned above, you can play around with this, the amount of epochs, and the dropout rate to affect the speed of the network’s learning, overfitting, and overall end result.

callbacks = [EarlyStopping(monitor='val_loss', patience=2),
         ModelCheckpoint(filepath='best_model.h5', monitor='val_loss', save_best_only=True)]
res = model.fit(X_train, y_train, epochs=15, verbose=True, callbacks=callbacks, validation_data=(X_test, y_test), batch_size=100)

*Now we add some callbacks to allow us to go back to the best model, and stop training soon, if the neural network starts getting worse with training. *The batch size of 100 was also found through trial and error.

Picking a sigmoid threshold

thresholds=[0.1,0.2,0.25,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for val in thresholds:
    pred=y_pred.copy()
  
    pred[pred>=val]=1
    pred[pred<val]=0
  
    precision = precision_score(y_test, pred, average='micro')
    recall = recall_score(y_test, pred, average='micro')
    f1 = f1_score(y_test, pred, average='micro')
   
    print("Threshold: {:.4f}, Precision: {:.4f}, Recall: {:.4f}, F1-measure: {:.4f}".format(val, precision, recall, f1))

This outputs the precision and recall, and resultant F1 score (mean of precision and recall) for each threshold. We could be even more fancy here, and optimise a threshold for each emotion, depending on the exact emotion we wanted to detect. But I don’t.

Results

After a modest 12 epochs, the network optimises its loss function, and we obtain an F1 score of almost 42%.

training

Using the above threshold loop, it looks like at a threshold of 0.25, we get the optimal F1 score:

thresholds

Using this threshold of 0.25, we can determine the F1 score for each individual emotion:

column_names = list(df.columns[9:])
threshold = 0.25
for i in range(0,27):
    emotion_prediction = y_pred[:,i]
    emotion_prediction[emotion_prediction>=threshold]=1
    emotion_prediction[emotion_prediction<threshold]=0
    emotion_test = y_test[:,i]
    precision = precision_score(emotion_test, emotion_prediction)
    recall = recall_score(emotion_test, emotion_prediction)
    f1 = f1_score(emotion_test, emotion_prediction)
    print("Emotion: {}, Precision: {:.4f}, Recall: {:.4f}, F1-measure: {:.4f}".format(column_names[i], precision, recall, f1))

import matplotlib.pyplot as plt
fig = plt.figure()
plt.bar(column_names,f1_scores)
plt.xticks(rotation=90)
plt.show()

Some interesting results here, when graphing emotions’ F1 scores:

emotion_performance

As anticipated, some emotions (amusement, gratitude, love) are way easier to predict than subtle emotions like disappointment, realization, relief.

Taking this model further

As mentioned through the writeup, there are several ways we could improve on this model:

© 2024 Ryan Anderson