{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "GwyCgZ0NQdkD" }, "source": [ "#
CS568:Deep Learning
Spring 2020
" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "AnhFhLUiQdvB" }, "source": [ "## Sentiment Analysis Using RNN\n", "In this recitation, we will use RNN to classify the sentiment of a piece of text using Pytorch." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "c_d02nPIQcyg" }, "source": [ "### Load dataset\n", "Pytorch uses torchtext to preprocess raw text data. NLP projects require these steps for preprocessing:\n", "\n", "+ Read the data from disk\n", "+ Tokenize the text\n", "+ Create a mapping from word to a unique integer\n", "+ Convert the text into lists of integers\n", "+ Load the data in whatever format your deep learning framework requires\n", "+ Pad the text so that all the sequences are the same length, so you can process them in batch\n", "\n", "**Torchtext** is a library that makes all the above processing much easier. \n", "\n", "**[Spacy](https://spacy.io/)** is a library that has been specifically built to take sentences in various languages and split them into different tokens.\n", "\n", "how to tokenize the text using Torchtext and Spacy?\n", "\n", "![alt text](1.png)" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": {}, "colab_type": "code", "id": "wD-o0jvNQZYJ" }, "outputs": [], "source": [ "import torch\n", "import random\n", "from torchtext import data\n", "seed = 1234\n", "torch.manual_seed(seed)\n", "\n", "text = data.Field(tokenize = 'spacy')\n", "label = data.LabelField(dtype = torch.float)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "hY3LraDURqnf" }, "source": [ "Download IMDB dataset with using torchtext. The IMDb dataset consists of 50,000 movie reviews, each marked as being a positive or negative review. This code automatically download dataset with train and test splits. " ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 85 }, "colab_type": "code", "id": "e9AsRrm0QoNT", "outputId": "c344a3b0-5eb2-480d-c692-51703af52041" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "downloading aclImdb_v1.tar.gz\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:07<00:00, 10.8MB/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Number of training examples: 25000\n", "Number of testing examples: 25000\n" ] } ], "source": [ "from torchtext import datasets\n", "\n", "training_data, testing_data = datasets.IMDB.splits(text, label)\n", "print('Number of training examples: ',len(training_data))\n", "print('Number of testing examples: ', len(testing_data))" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 54 }, "colab_type": "code", "id": "5ap6-g0ZQ72X", "outputId": "9e17b3e0-e24a-42db-dcc4-bc14ad299611" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'text': ['Interesting', 'mix', 'of', 'comments', 'that', 'it', 'would', 'be', 'hard', 'to', 'add', 'anything', 'constructive', 'to', '.', 'However', ',', 'i', \"'ll\", 'try', '.', 'This', 'was', 'a', 'very', 'good', 'action', 'film', 'with', 'some', 'great', 'set', 'pieces', '.', 'You', \"'ll\", 'note', 'I', 'specified', 'the', 'genre', '.', 'I', 'did', \"n't\", 'snipe', 'about', 'the', 'lack', 'of', 'characterisation', ',', 'and', 'I', 'did', \"n't\", 'berate', 'the', 'acting', '.', 'Enjoy', 'if', 'for', 'what', 'it', 'is', 'people', ',', 'a', 'well', 'above', 'average', 'action', 'film', '.', 'I', 'could', 'go', 'on', 'but', 'I', \"'ve\", 'made', 'my', 'comment', '.'], 'label': 'pos'}\n" ] } ], "source": [ "print(vars(training_data.examples[0]))" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "iP08J_qzSBU2" }, "source": [ "Create a validation set using .split() method. " ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 68 }, "colab_type": "code", "id": "ecizUX9kRmxE", "outputId": "d8359cee-3b35-4825-e2b5-7732afbf0df6" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of training examples: 17500\n", "Number of validation examples: 7500\n", "Number of testing examples: 25000\n" ] } ], "source": [ "training_data, validation_data = training_data.split(random_state = random.seed(SEED), split_ratio = 0.7)\n", "# split ratio to split training data into train and validation sets. By default, it splits them with 0.7 ratio. \n", "print('Number of training examples: ',len(training_data))\n", "print('Number of validation examples: ',len(validation_data))\n", "print('Number of testing examples: ',len(testing_data))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have to build a vocabulary. \n", "\n", "+ Computer cannot operate on strings, only numbers.\n", "+ The index of each word is used to construct a one-hot vector. \n", "\n", "| Word | Index | One-hot vectors |\n", "|--------|:-------------:|------------------:|\n", "| apple | 0 |[1, 0, 0, 0, 0, 0] |\n", "| orange | 1 |[0, 1, 0, 0, 0, 0] |\n", "| king | 2 |[0, 0, 1, 0, 0, 0] |\n", "| queen | 3 |[0, 0, 0, 1, 0, 0] |\n", "| cat | 4 |[0, 0, 0, 0, 1, 0] |\n", "| dog | 5 |[0, 0, 0, 0, 0, 1] |\n", "\n", "+ The vocabulary of our dataset which consists of words and their unique indexes $V = \\{apple, orange, king, queen, cat, dog\\}$\n", "+ If we have 100,000 words in our vocabulary then one-hot vectors has 100,000 dimensions. \n", "+ This will make training slow and won't even fit into the memory of your computer.\n", "+ The vocabulary can be cut down to take the top $n$ most common words.\n", "+ Replace the words that do not appear in vocabulary with $$ token.\n", "+ To ensure that each sentence in the batch has same size, pad the short sentences with $$ token. " ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 51 }, "colab_type": "code", "id": "DcwKdoTBS294", "outputId": "b6275739-8d5c-49bc-a3af-358c90d2a32b" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Unique tokens in TEXT vocabulary: 25002\n", "Unique tokens in LABEL vocabulary: 2\n" ] } ], "source": [ "vocab_size = 25000\n", "text.build_vocab(train_data, max_size = vocab_size)\n", "label.build_vocab(train_data)\n", "print('Unique tokens in text vocab: ',len(text.vocab))\n", "print('Unique tokens in label vocab: ',len(label.vocab))" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "SS4EhlJEUJ6o" }, "source": [ "Why is the vocab size 25002 and not 25000? One of the addition tokens is the < unk > token and the other is a < pad > token.\n", " \n", "Torchtext has its own class called vocab for handling the vocabulary. The vocab class holds a mapping from word to id in its **stoi** attribute and a reverse mapping in its **itos** attribute. In addition to this, it can automatically build an embedding matrix for you using various pretrained embeddings like word2vec.\n", "\n", "Print most common 10 words from the vocab. " ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "82_LTN7sTbmV", "outputId": "afc463b0-b1a0-4d10-9df8-f4a65d143206" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[('the', 202286), (',', 192561), ('.', 165812), ('a', 109233), ('and', 109205)]\n" ] } ], "source": [ "print(text.vocab.freqs.most_common(5))" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "vdcgpFuJUjjR" }, "source": [ "To view the vocab., use stoi (string to int) or itos (int to string) methods. \n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 51 }, "colab_type": "code", "id": "VaW-LNp_UaDZ", "outputId": "5e2ccdfa-79b6-421c-c638-7ccac9b4e994" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['', '', 'the', ',', '.', 'a', 'and', 'of', 'to', 'is']\n", "defaultdict(, {'neg': 0, 'pos': 1})\n" ] } ], "source": [ "print(text.vocab.itos[:10])\n", "print(label.vocab.stoi)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "yLrz4sqQVdus" }, "source": [ "Use **BuketIterator** to return batches of similar length sentences. " ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": {}, "colab_type": "code", "id": "RIP-vxjqVaN5" }, "outputs": [], "source": [ "batch_size = 64\n", "\n", "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n", "\n", "train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(\n", " (training_data, validation_data, testing_data), \n", " batch_size = batch_size,\n", " device = device)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "6hl5RdsbXiex" }, "source": [ "### Define model\n", "**Word Embeddings**\n", "+ learned representaton of words\n", "+ similar meaning - similar representation\n", "+ dense\n", "+ low dimensional representation\n", "\n", "\n", "**Embedding layer** transforms sparse one-hot vector into a dense embedding vector.\n", "\n", "\n", "![alt text](2.png)\n", "\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": {}, "colab_type": "code", "id": "FrbiZmIaVzQI" }, "outputs": [], "source": [ "import torch.nn as nn\n", "\n", "class RNN(nn.Module):\n", " def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim): \n", " super().__init__() \n", " self.embedding = nn.Embedding(input_dim, embedding_dim) \n", " self.rnn = nn.RNN(embedding_dim, hidden_dim) \n", " self.fc = nn.Linear(hidden_dim, output_dim)\n", " \n", " def forward(self, text):\n", " #text = [sent len, batch size] \n", " # one-hot vector (indices of non zero values)\n", " embedded = self.embedding(text) \n", " #embedded = [sent len, batch size, emb dim], dense vectors\n", " # rnn takes first hidden state initialize with zeros (by default)\n", " output, hidden = self.rnn(embedded) \n", " #output = [sent len, batch size, hid dim]\n", " #hidden = [1, batch size, hid dim] \n", " assert torch.equal(output[-1,:,:], hidden.squeeze(0)) \n", " out = self.fc(hidden.squeeze(0))\n", " return out" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": {}, "colab_type": "code", "id": "AYfUD8ULWn6w" }, "outputs": [], "source": [ "input_dim = len(text.vocab)\n", "embedding_dim = 100\n", "hidden_dim = 256\n", "output_dim = 1\n", "\n", "model = RNN(input_dim, embedding_dim, hidden_dim, output_dim)" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "Jp_uCwLoWz53", "outputId": "aeceea1b-a91c-433d-b59d-6b6b93f3fe71" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Trainable parameters: 2592105\n" ] } ], "source": [ "def count_parameters(model):\n", " return sum(p.numel() for p in model.parameters() if p.requires_grad)\n", "print('Trainable parameters: ', count_parameters(model))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# return accuracy per batch, 8/10 return 0.8\n", "def binary_accuracy(preds, y): \n", " rounded_preds = torch.round(torch.sigmoid(preds))\n", " correct = (rounded_preds == y).float() \n", " acc = correct.sum() / len(correct)\n", " return acc" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "EpBIr0BrXlc7" }, "source": [ "### Define optimizer" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": {}, "colab_type": "code", "id": "zIZeR2PkXUz-" }, "outputs": [], "source": [ "import torch.optim as optim\n", "\n", "optimizer = optim.SGD(model.parameters(), lr=1e-3)\n", "criterion = nn.BCEWithLogitsLoss()\n", "model = model.to(device)\n", "criterion = criterion.to(device)" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": {}, "colab_type": "code", "id": "Suca6MLkYuCp" }, "outputs": [], "source": [ "def train(model, iterator, optimizer, criterion):\n", " \n", " epoch_loss = 0\n", " epoch_acc = 0\n", " \n", " model.train()\n", " \n", " for batch in iterator: \n", " optimizer.zero_grad() \n", " predictions = model(batch.text).squeeze(1) \n", " loss = criterion(predictions, batch.label) \n", " acc = binary_accuracy(predictions, batch.label) \n", " loss.backward() \n", " optimizer.step() \n", " epoch_loss += loss.item()\n", " epoch_acc += acc.item()\n", " \n", " return epoch_loss / len(iterator), epoch_acc / len(iterator)" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": {}, "colab_type": "code", "id": "Vu63DL-8Y6xg" }, "outputs": [], "source": [ "def evaluate(model, iterator, criterion):\n", " \n", " epoch_loss = 0\n", " epoch_acc = 0\n", " \n", " model.eval()\n", " \n", " with torch.no_grad(): \n", " for batch in iterator:\n", " predictions = model(batch.text).squeeze(1) \n", " loss = criterion(predictions, batch.label) \n", " acc = binary_accuracy(predictions, batch.label)\n", " epoch_loss += loss.item()\n", " epoch_acc += acc.item()\n", " \n", " return epoch_loss / len(iterator), epoch_acc / len(iterator)" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": {}, "colab_type": "code", "id": "N7lEtjy0ZFVX" }, "outputs": [], "source": [ "import time\n", "\n", "def epoch_time(start_time, end_time):\n", " elapsed_time = end_time - start_time\n", " elapsed_mins = int(elapsed_time / 60)\n", " elapsed_secs = int(elapsed_time - (elapsed_mins * 60))\n", " return elapsed_mins, elapsed_secs" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 272 }, "colab_type": "code", "id": "_5s6MDCOZJx4", "outputId": "a7d96918-dbad-4652-c3c6-6ba616aa940a" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch: 01 | Epoch Time: 0m 14s\n", "\tTrain Loss: 0.694 | Train Acc: 50.33%\n", "\t Val. Loss: 0.698 | Val. Acc: 49.19%\n", "Epoch: 02 | Epoch Time: 0m 13s\n", "\tTrain Loss: 0.693 | Train Acc: 49.62%\n", "\t Val. Loss: 0.698 | Val. Acc: 49.13%\n", "Epoch: 03 | Epoch Time: 0m 14s\n", "\tTrain Loss: 0.693 | Train Acc: 50.09%\n", "\t Val. Loss: 0.698 | Val. Acc: 50.36%\n", "Epoch: 04 | Epoch Time: 0m 14s\n", "\tTrain Loss: 0.693 | Train Acc: 49.56%\n", "\t Val. Loss: 0.698 | Val. Acc: 49.02%\n", "Epoch: 05 | Epoch Time: 0m 13s\n", "\tTrain Loss: 0.693 | Train Acc: 50.00%\n", "\t Val. Loss: 0.698 | Val. Acc: 50.55%\n" ] } ], "source": [ "epochs = 5\n", "\n", "best_valid_loss = 0\n", "for epoch in range(epochs): \n", " start_time = time.time() \n", " train_loss, train_acc = train(model, train_iterator, optimizer, criterion)\n", " valid_loss, valid_acc = evaluate(model, valid_iterator, criterion) \n", " end_time = time.time()\n", " epoch_mins, epoch_secs = epoch_time(start_time, end_time)\n", " \n", " if valid_loss < best_valid_loss:\n", " best_valid_loss = valid_loss\n", " #torch.save(model.state_dict(), 'model.pt')\n", " \n", " print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')\n", " print(f'\\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')\n", " print(f'\\t Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:.2f}%')" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "j9WYY3hTl4H_", "outputId": "a5c0302d-ab14-46da-f98b-d041ff4d76a8" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test Loss: 0.715 | Test Acc: 45.90%\n" ] } ], "source": [ "#model.load_state_dict(torch.load('model.pt'))\n", "test_loss, test_acc = evaluate(model, test_iterator, criterion)\n", "print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')" ] } ], "metadata": { "accelerator": "GPU", "colab": { "name": "RNN_pytorch.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 1 }