{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "Ydg2bq1IE1P6" }, "source": [ "#
CS568:Deep Learning
Spring 2020
" ] }, { "cell_type": "markdown", "metadata": { "id": "h-ds3_L4E38m" }, "source": [ "## Sentiment Analysis Using LSTM\n", "In this recitation, we will use LSTM cell to classify the sentiment of a piece of text using Pytorch.\n", "\n", "The changes we make in the previous notebook are:\n", "+ Used packed padded sequences\n", "+ Pre-trained word embeddings\n", "+ LSTM cell \n", "+ Dropout\n", "+ Adam optimizer" ] }, { "cell_type": "markdown", "metadata": { "id": "a1m9lJwmE6q0" }, "source": [ "### Load dataset\n", "Pytorch uses torchtext to preprocess raw text data. NLP projects require these steps for preprocessing:\n", "\n", "+ Read the data from disk\n", "+ Tokenize the text\n", "+ Create a mapping from word to a unique integer\n", "+ Convert the text into lists of integers\n", "+ Load the data in whatever format your deep learning framework requires\n", "+ Pad the text so that all the sequences are the same length, so you can process them in batch\n", "\n", "**Torchtext** is a library that makes all of the above processing much easier. \n", "\n", "**[Spacy](https://spacy.io/)** is a library that has been specifically built to take sentences in various languages and split them into different tokens.\n", "\n", "how to tokenize the text using Torchtext and Spacy?\n", "\n", "![alt text](1.png)\n", "\n", "Packed padded sequences will make RNN or LSTM only process the non-padded elements of sequence, and for any padded element the output will be a zero tensor.\n", "\n", "To use packed padded sequences, need to record the length of the actual sentences. \n", "\n", "Use include_lengths = True for our TEXT field, it will return tuple with the first element being our sentence (a numericalized tensor that has been padded) and the second element being the actual lengths of our sentences." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "bVLTIlXUE7Gl" }, "outputs": [], "source": [ "import torch\n", "import random\n", "from torchtext import data\n", "seed = 1234\n", "\n", "torch.manual_seed(seed)\n", "text = data.Field(tokenize = 'spacy', include_lengths = True)\n", "label = data.LabelField(dtype = torch.float)" ] }, { "cell_type": "markdown", "metadata": { "id": "F2PS9oiiFGRT" }, "source": [ "Download IMDB dataset with using torchtext. The IMDb dataset consists of 50,000 movie reviews, each marked as being a positive or negative review. This code automatically download dataset with train and test splits. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "8H-r4mgZFDFF" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "aclImdb_v1.tar.gz: 0%| | 0.00/84.1M [00:00 token and the other is a < pad > token.\n", " \n", "Torchtext has its own class called vocab for handling the vocabulary. The vocab class holds a mapping from word to id in its **stoi** attribute and a reverse mapping in its **itos** attribute. In addition to this, it can automatically build an embedding matrix for you using various pretrained embeddings like word2vec.\n", "\n", "Print most common 10 words from the vocab. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "id": "A9RpEMQdGftu" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[('the', 202874), (',', 192507), ('.', 166290), ('a', 109355), ('and', 109273)]\n" ] } ], "source": [ "print(text.vocab.freqs.most_common(5))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To view the vocab., use stoi (string to int) or itos (int to string) methods. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "id": "HR1kBZnIHYlU" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['', '', 'the', ',', '.', 'a', 'and', 'of', 'to', 'is']\n", "defaultdict(None, {'neg': 0, 'pos': 1})\n" ] } ], "source": [ "print(text.vocab.itos[:10])\n", "print(label.vocab.stoi)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use **BuketIterator** to return batches of similar length sentences. " ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "id": "qkWM0tAUHaaC" }, "outputs": [], "source": [ "batch_size = 64\n", "\n", "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n", "\n", "train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(\n", " (training_data, validation_data, testing_data), batch_size = batch_size,\n", " sort_within_batch = True,\n", " device = device)" ] }, { "cell_type": "markdown", "metadata": { "id": "kmUgptIFH5vt" }, "source": [ "### Define model\n", "\n", "The LSTM looks like:\n", "\n", "![alt text](3.png)\n", "\n", "LSTM returns the output and a tuple of the final hidden state and then final cell state whereas the RNN returns the output and the final hidden state. \n", "\n", "To use packed padded sequences, use nn.utils.rnn.packed_padded_sequence which process the non-padded elements of sequence. The RNN will then return packed_output (a packed sequence) as well as the hidden and cell states tensors. \n", "\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "id": "CWMwUpwfH2Rs" }, "outputs": [], "source": [ "import torch.nn as nn\n", "\n", "class RNN(nn.Module):\n", " def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, dropout, pad_idx): \n", " super().__init__() \n", " self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx) \n", " self.rnn = nn.LSTM(embedding_dim, hidden_dim, dropout=dropout) \n", " self.fc = nn.Linear(hidden_dim, output_dim) \n", " self.dropout = nn.Dropout(dropout)\n", " \n", " def forward(self, text, text_lengths): \n", " #text = [sent len, batch size] \n", " embedded = self.dropout(self.embedding(text)) \n", " #embedded = [sent len, batch size, emb dim] \n", " \n", " #pack sequence\n", " packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths) \n", " packed_output, (hidden, cell) = self.rnn(packed_embedded)\n", " \n", " #unpack sequence\n", " output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)\n", "\n", " #output = [sent len, batch size, hid dim]\n", " #output over padding tokens are zero tensors \n", " #hidden = [batch size, hid dim]\n", " #cell = [batch size, hid dim] \n", " \n", " hidden = self.dropout(hidden) \n", " #hidden = [batch size, hid dim] \n", " out = self.fc(hidden)\n", " return out" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To ensure the pre-trained vectors can be loaded into the model, the embedding_dim must be equal to the pre-trained GloVe vectors loaded earlier." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "id": "d4N4x38mIL3m" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/rnn.py:50: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.5 and num_layers=1\n", " \"num_layers={}\".format(dropout, num_layers))\n" ] } ], "source": [ "input_dim = len(text.vocab)\n", "embedding_dim = 100\n", "hidden_dim = 256\n", "output_dim = 1\n", "dropout = 0.5\n", "pad_idx = text.vocab.stoi[text.pad_token] # indices of pad token\n", "\n", "model = RNN(input_dim, embedding_dim, hidden_dim, output_dim, dropout, pad_idx)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "id": "Ea9zgx_CI6Y5" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The model has 2,867,049 trainable parameters\n" ] } ], "source": [ "def count_parameters(model):\n", " return sum(p.numel() for p in model.parameters() if p.requires_grad)\n", "\n", "print(f'The model has {count_parameters(model):,} trainable parameters')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Replace the initial weights of the embedding layer with the pre-trained embeddings." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "id": "reIaiqQyI-p6" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "pretrained embedding shape: torch.Size([25002, 100])\n" ] }, { "data": { "text/plain": [ "tensor([[-0.1117, -0.4966, 0.1631, ..., 1.2647, -0.2753, -0.1325],\n", " [-0.8555, -0.7208, 1.3755, ..., 0.0825, -1.1314, 0.3997],\n", " [-0.0382, -0.2449, 0.7281, ..., -0.1459, 0.8278, 0.2706],\n", " ...,\n", " [ 0.4979, -0.8359, -0.3487, ..., 0.3010, 0.8162, -0.2722],\n", " [-0.5121, 0.2059, 0.5721, ..., -0.7374, -0.2360, 0.3210],\n", " [ 0.9147, 0.2111, -0.0431, ..., 0.3362, 0.3323, 0.0240]])" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pretrained_embeddings = text.vocab.vectors\n", "print(\"pretrained embedding shape:\",pretrained_embeddings.shape)\n", "model.embedding.weight.data.copy_(pretrained_embeddings)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set the < unk > and < pad > tokens in the embedding matrix with zeros." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "id": "jt9QEdqaJBBK" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tensor([[ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],\n", " [ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],\n", " [-0.0382, -0.2449, 0.7281, ..., -0.1459, 0.8278, 0.2706],\n", " ...,\n", " [ 0.4979, -0.8359, -0.3487, ..., 0.3010, 0.8162, -0.2722],\n", " [-0.5121, 0.2059, 0.5721, ..., -0.7374, -0.2360, 0.3210],\n", " [ 0.9147, 0.2111, -0.0431, ..., 0.3362, 0.3323, 0.0240]])\n" ] } ], "source": [ "unk_idx = text.vocab.stoi[text.unk_token]\n", "model.embedding.weight.data[unk_idx] = torch.zeros(embedding_dim)\n", "model.embedding.weight.data[pad_idx] = torch.zeros(embedding_dim)\n", "print(model.embedding.weight.data)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "id": "z8EoX8XgJG78" }, "outputs": [], "source": [ "import torch.optim as optim\n", "\n", "optimizer = optim.Adam(model.parameters())\n", "criterion = nn.BCEWithLogitsLoss()\n", "model = model.to(device)\n", "criterion = criterion.to(device)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "id": "qnkmb-0qJT-r" }, "outputs": [], "source": [ "def binary_accuracy(preds, y):\n", " #round predictions to the closest integer\n", " rounded_preds = torch.round(torch.sigmoid(preds))\n", " correct = (rounded_preds == y).float() #convert into float for division \n", " acc = correct.sum() / len(correct)\n", " return acc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "batch.text is now a tuple with the first element being the numericalized tensor and the second element being the actual lengths of each sequence. " ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "id": "izjDK96pJYVC" }, "outputs": [], "source": [ "def train(model, iterator, optimizer, criterion): \n", " epoch_loss = 0\n", " epoch_acc = 0 \n", " model.train() \n", " for batch in iterator: \n", " optimizer.zero_grad() \n", " text, text_lengths = batch.text \n", " predictions = model(text, text_lengths).squeeze() \n", " loss = criterion(predictions, batch.label) \n", " acc = binary_accuracy(predictions, batch.label) \n", " loss.backward() \n", " optimizer.step() \n", " epoch_loss += loss.item()\n", " epoch_acc += acc.item() \n", " return epoch_loss / len(iterator), epoch_acc / len(iterator)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "id": "B9xErym0Je38" }, "outputs": [], "source": [ "def evaluate(model, iterator, criterion): \n", " epoch_loss = 0\n", " epoch_acc = 0 \n", " model.eval() \n", " with torch.no_grad(): \n", " for batch in iterator:\n", " text, text_lengths = batch.text \n", " predictions = model(text, text_lengths).squeeze() \n", " loss = criterion(predictions, batch.label) \n", " acc = binary_accuracy(predictions, batch.label)\n", " epoch_loss += loss.item()\n", " epoch_acc += acc.item() \n", " return epoch_loss / len(iterator), epoch_acc / len(iterator)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "id": "P8H9TB6nJjb7" }, "outputs": [], "source": [ "import time\n", "\n", "def epoch_time(start_time, end_time):\n", " elapsed_time = end_time - start_time\n", " elapsed_mins = int(elapsed_time / 60)\n", " elapsed_secs = int(elapsed_time - (elapsed_mins * 60))\n", " return elapsed_mins, elapsed_secs" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "id": "zaqhMQqGJl59" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch: 01 | Epoch Time: 0m 11s\n", "\tTrain Loss: 0.689 | Train Acc: 55.08%\n", "\t Val. Loss: 0.672 | Val. Acc: 59.33%\n", "Epoch: 02 | Epoch Time: 0m 10s\n", "\tTrain Loss: 0.662 | Train Acc: 60.52%\n", "\t Val. Loss: 0.623 | Val. Acc: 66.60%\n", "Epoch: 03 | Epoch Time: 0m 10s\n", "\tTrain Loss: 0.571 | Train Acc: 71.95%\n", "\t Val. Loss: 0.500 | Val. Acc: 78.57%\n", "Epoch: 04 | Epoch Time: 0m 11s\n", "\tTrain Loss: 0.494 | Train Acc: 77.18%\n", "\t Val. Loss: 0.539 | Val. Acc: 75.42%\n", "Epoch: 05 | Epoch Time: 0m 10s\n", "\tTrain Loss: 0.377 | Train Acc: 83.82%\n", "\t Val. Loss: 0.303 | Val. Acc: 87.79%\n" ] } ], "source": [ "num_epochs = 5\n", "best_valid_loss = 0.0\n", "for epoch in range(num_epochs):\n", " start_time = time.time() \n", " train_loss, train_acc = train(model, train_iterator, optimizer, criterion)\n", " valid_loss, valid_acc = evaluate(model, valid_iterator, criterion) \n", " end_time = time.time()\n", " epoch_mins, epoch_secs = epoch_time(start_time, end_time) \n", " if valid_loss < best_valid_loss:\n", " best_valid_loss = valid_loss\n", " torch.save(model.state_dict(), 'sentiment-analyis-RNN-model.pt')\n", " \n", " print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')\n", " print(f'\\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')\n", " print(f'\\t Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:.2f}%')" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "id": "T-tSHa48JuRG" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test Loss: 0.320 | Test Acc: 86.46%\n" ] } ], "source": [ "test_loss, test_acc = evaluate(model, test_iterator, criterion)\n", "print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 4 }