{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "GwyCgZ0NQdkD"
   },
   "source": [
    "# <center>CS568:Deep Learning</center>  <center>Spring 2020</center> "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "AnhFhLUiQdvB"
   },
   "source": [
    "## Sentiment Analysis Using RNN\n",
    "In this recitation, we will use RNN to classify the sentiment of a piece of text using Pytorch."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "c_d02nPIQcyg"
   },
   "source": [
    "### Load dataset\n",
    "Pytorch uses torchtext to preprocess raw text data. NLP projects require these steps for preprocessing:\n",
    "\n",
    "+ Read the data from disk\n",
    "+ Tokenize the text\n",
    "+ Create a mapping from word to a unique integer\n",
    "+ Convert the text into lists of integers\n",
    "+ Load the data in whatever format your deep learning framework requires\n",
    "+ Pad the text so that all the sequences are the same length, so you can process them in batch\n",
    "\n",
    "**Torchtext** is a library that makes all the above processing much easier. \n",
    "\n",
    "**[Spacy](https://spacy.io/)** is a library that has been specifically built to take sentences in various languages and split them into different tokens.\n",
    "\n",
    "how to tokenize the text using Torchtext and Spacy?\n",
    "\n",
    "![alt text](1.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "wD-o0jvNQZYJ"
   },
   "outputs": [],
   "source": [
    "import torch\n",
    "import random\n",
    "from torchtext import data\n",
    "seed = 1234\n",
    "torch.manual_seed(seed)\n",
    "\n",
    "text = data.Field(tokenize = 'spacy')\n",
    "label = data.LabelField(dtype = torch.float)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "hY3LraDURqnf"
   },
   "source": [
    "Download IMDB dataset with using torchtext. The IMDb dataset consists of 50,000 movie reviews, each marked as being a positive or negative review. This code automatically download dataset with train and test splits. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 85
    },
    "colab_type": "code",
    "id": "e9AsRrm0QoNT",
    "outputId": "c344a3b0-5eb2-480d-c692-51703af52041"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "downloading aclImdb_v1.tar.gz\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:07<00:00, 10.8MB/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of training examples:  25000\n",
      "Number of testing examples:  25000\n"
     ]
    }
   ],
   "source": [
    "from torchtext import datasets\n",
    "\n",
    "training_data, testing_data = datasets.IMDB.splits(text, label)\n",
    "print('Number of training examples: ',len(training_data))\n",
    "print('Number of testing examples: ', len(testing_data))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 54
    },
    "colab_type": "code",
    "id": "5ap6-g0ZQ72X",
    "outputId": "9e17b3e0-e24a-42db-dcc4-bc14ad299611"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'text': ['Interesting', 'mix', 'of', 'comments', 'that', 'it', 'would', 'be', 'hard', 'to', 'add', 'anything', 'constructive', 'to', '.', 'However', ',', 'i', \"'ll\", 'try', '.', 'This', 'was', 'a', 'very', 'good', 'action', 'film', 'with', 'some', 'great', 'set', 'pieces', '.', 'You', \"'ll\", 'note', 'I', 'specified', 'the', 'genre', '.', 'I', 'did', \"n't\", 'snipe', 'about', 'the', 'lack', 'of', 'characterisation', ',', 'and', 'I', 'did', \"n't\", 'berate', 'the', 'acting', '.', 'Enjoy', 'if', 'for', 'what', 'it', 'is', 'people', ',', 'a', 'well', 'above', 'average', 'action', 'film', '.', 'I', 'could', 'go', 'on', 'but', 'I', \"'ve\", 'made', 'my', 'comment', '.'], 'label': 'pos'}\n"
     ]
    }
   ],
   "source": [
    "print(vars(training_data.examples[0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "iP08J_qzSBU2"
   },
   "source": [
    "Create a validation set using .split() method. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 68
    },
    "colab_type": "code",
    "id": "ecizUX9kRmxE",
    "outputId": "d8359cee-3b35-4825-e2b5-7732afbf0df6"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of training examples:  17500\n",
      "Number of validation examples:  7500\n",
      "Number of testing examples:  25000\n"
     ]
    }
   ],
   "source": [
    "training_data, validation_data = training_data.split(random_state = random.seed(SEED), split_ratio = 0.7)\n",
    "# split ratio to split training data into train and validation sets. By default, it splits them with 0.7 ratio. \n",
    "print('Number of training examples: ',len(training_data))\n",
    "print('Number of validation examples: ',len(validation_data))\n",
    "print('Number of testing examples: ',len(testing_data))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have to build a vocabulary. \n",
    "\n",
    "+ Computer cannot operate on strings, only numbers.\n",
    "+ The index of each word is used to construct a one-hot vector. \n",
    "\n",
    "| Word   |      Index    |  One-hot vectors  |\n",
    "|--------|:-------------:|------------------:|\n",
    "| apple  |  0            |[1, 0, 0, 0, 0, 0] |\n",
    "| orange |  1            |[0, 1, 0, 0, 0, 0] |\n",
    "| king   |  2            |[0, 0, 1, 0, 0, 0] |\n",
    "| queen  |  3            |[0, 0, 0, 1, 0, 0] |\n",
    "| cat    |  4            |[0, 0, 0, 0, 1, 0] |\n",
    "| dog    |  5            |[0, 0, 0, 0, 0, 1] |\n",
    "\n",
    "+ The vocabulary of our dataset which consists of words and  their unique indexes $V = \\{apple, orange, king, queen, cat, dog\\}$\n",
    "+ If we have 100,000 words in our vocabulary then one-hot vectors has 100,000 dimensions. \n",
    "+ This will make training slow and won't even fit into the memory of your computer.\n",
    "+ The vocabulary can be cut down to take the top $n$  most common words.\n",
    "+ Replace the words that do not appear in vocabulary with $<unk>$ token.\n",
    "+ To ensure that each sentence in the batch has same size, pad the short sentences with $<pad>$ token. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 51
    },
    "colab_type": "code",
    "id": "DcwKdoTBS294",
    "outputId": "b6275739-8d5c-49bc-a3af-358c90d2a32b"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Unique tokens in TEXT vocabulary:  25002\n",
      "Unique tokens in LABEL vocabulary:  2\n"
     ]
    }
   ],
   "source": [
    "vocab_size = 25000\n",
    "text.build_vocab(train_data, max_size = vocab_size)\n",
    "label.build_vocab(train_data)\n",
    "print('Unique tokens in text vocab: ',len(text.vocab))\n",
    "print('Unique tokens in label vocab: ',len(label.vocab))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "SS4EhlJEUJ6o"
   },
   "source": [
    "Why is the vocab size 25002 and not 25000? One of the addition tokens is the < unk > token and the other is a < pad > token.\n",
    "    \n",
    "Torchtext has its own class called vocab for handling the vocabulary. The vocab class holds a mapping from word to id in its **stoi** attribute and a reverse mapping in its **itos** attribute. In addition to this, it can automatically build an embedding matrix for you using various pretrained embeddings like word2vec.\n",
    "\n",
    "Print most common 10 words from the vocab. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "id": "82_LTN7sTbmV",
    "outputId": "afc463b0-b1a0-4d10-9df8-f4a65d143206"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[('the', 202286), (',', 192561), ('.', 165812), ('a', 109233), ('and', 109205)]\n"
     ]
    }
   ],
   "source": [
    "print(text.vocab.freqs.most_common(5))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "vdcgpFuJUjjR"
   },
   "source": [
    "To view the vocab., use stoi (string to int) or itos (int to string) methods. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 51
    },
    "colab_type": "code",
    "id": "VaW-LNp_UaDZ",
    "outputId": "5e2ccdfa-79b6-421c-c638-7ccac9b4e994"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['<unk>', '<pad>', 'the', ',', '.', 'a', 'and', 'of', 'to', 'is']\n",
      "defaultdict(<function _default_unk_index at 0x7fca87925c80>, {'neg': 0, 'pos': 1})\n"
     ]
    }
   ],
   "source": [
    "print(text.vocab.itos[:10])\n",
    "print(label.vocab.stoi)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "yLrz4sqQVdus"
   },
   "source": [
    "Use **BuketIterator** to return batches of similar length sentences. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "RIP-vxjqVaN5"
   },
   "outputs": [],
   "source": [
    "batch_size = 64\n",
    "\n",
    "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n",
    "\n",
    "train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(\n",
    "    (training_data, validation_data, testing_data), \n",
    "    batch_size = batch_size,\n",
    "    device = device)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "6hl5RdsbXiex"
   },
   "source": [
    "### Define model\n",
    "**Word Embeddings**\n",
    "+ learned representaton of words\n",
    "+ similar meaning - similar representation\n",
    "+ dense\n",
    "+ low dimensional representation\n",
    "\n",
    "\n",
    "**Embedding layer** transforms sparse one-hot vector into a dense embedding vector.\n",
    "\n",
    "\n",
    "![alt text](2.png)\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "FrbiZmIaVzQI"
   },
   "outputs": [],
   "source": [
    "import torch.nn as nn\n",
    "\n",
    "class RNN(nn.Module):\n",
    "    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):        \n",
    "        super().__init__()        \n",
    "        self.embedding = nn.Embedding(input_dim, embedding_dim)        \n",
    "        self.rnn = nn.RNN(embedding_dim, hidden_dim)        \n",
    "        self.fc = nn.Linear(hidden_dim, output_dim)\n",
    "        \n",
    "    def forward(self, text):\n",
    "        #text = [sent len, batch size]   \n",
    "        # one-hot vector (indices of non zero values)\n",
    "        embedded = self.embedding(text)        \n",
    "        #embedded = [sent len, batch size, emb dim], dense vectors\n",
    "        # rnn takes first hidden state initialize with zeros (by default)\n",
    "        output, hidden = self.rnn(embedded)        \n",
    "        #output = [sent len, batch size, hid dim]\n",
    "        #hidden = [1, batch size, hid dim]   \n",
    "        assert torch.equal(output[-1,:,:], hidden.squeeze(0)) \n",
    "        out = self.fc(hidden.squeeze(0))\n",
    "        return out"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "AYfUD8ULWn6w"
   },
   "outputs": [],
   "source": [
    "input_dim = len(text.vocab)\n",
    "embedding_dim = 100\n",
    "hidden_dim = 256\n",
    "output_dim = 1\n",
    "\n",
    "model = RNN(input_dim, embedding_dim, hidden_dim, output_dim)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "id": "Jp_uCwLoWz53",
    "outputId": "aeceea1b-a91c-433d-b59d-6b6b93f3fe71"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Trainable parameters:  2592105\n"
     ]
    }
   ],
   "source": [
    "def count_parameters(model):\n",
    "    return sum(p.numel() for p in model.parameters() if p.requires_grad)\n",
    "print('Trainable parameters: ', count_parameters(model))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# return accuracy per batch, 8/10 return 0.8\n",
    "def binary_accuracy(preds, y):  \n",
    "    rounded_preds = torch.round(torch.sigmoid(preds))\n",
    "    correct = (rounded_preds == y).float() \n",
    "    acc = correct.sum() / len(correct)\n",
    "    return acc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "EpBIr0BrXlc7"
   },
   "source": [
    "### Define optimizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "zIZeR2PkXUz-"
   },
   "outputs": [],
   "source": [
    "import torch.optim as optim\n",
    "\n",
    "optimizer = optim.SGD(model.parameters(), lr=1e-3)\n",
    "criterion = nn.BCEWithLogitsLoss()\n",
    "model = model.to(device)\n",
    "criterion = criterion.to(device)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "Suca6MLkYuCp"
   },
   "outputs": [],
   "source": [
    "def train(model, iterator, optimizer, criterion):\n",
    "    \n",
    "    epoch_loss = 0\n",
    "    epoch_acc = 0\n",
    "    \n",
    "    model.train()\n",
    "    \n",
    "    for batch in iterator:           \n",
    "        optimizer.zero_grad()            \n",
    "        predictions = model(batch.text).squeeze(1)        \n",
    "        loss = criterion(predictions, batch.label)        \n",
    "        acc = binary_accuracy(predictions, batch.label)        \n",
    "        loss.backward()        \n",
    "        optimizer.step()        \n",
    "        epoch_loss += loss.item()\n",
    "        epoch_acc += acc.item()\n",
    "        \n",
    "    return epoch_loss / len(iterator), epoch_acc / len(iterator)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "Vu63DL-8Y6xg"
   },
   "outputs": [],
   "source": [
    "def evaluate(model, iterator, criterion):\n",
    "    \n",
    "    epoch_loss = 0\n",
    "    epoch_acc = 0\n",
    "    \n",
    "    model.eval()\n",
    "    \n",
    "    with torch.no_grad():    \n",
    "        for batch in iterator:\n",
    "            predictions = model(batch.text).squeeze(1)            \n",
    "            loss = criterion(predictions, batch.label)            \n",
    "            acc = binary_accuracy(predictions, batch.label)\n",
    "            epoch_loss += loss.item()\n",
    "            epoch_acc += acc.item()\n",
    "        \n",
    "    return epoch_loss / len(iterator), epoch_acc / len(iterator)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "N7lEtjy0ZFVX"
   },
   "outputs": [],
   "source": [
    "import time\n",
    "\n",
    "def epoch_time(start_time, end_time):\n",
    "    elapsed_time = end_time - start_time\n",
    "    elapsed_mins = int(elapsed_time / 60)\n",
    "    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))\n",
    "    return elapsed_mins, elapsed_secs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 272
    },
    "colab_type": "code",
    "id": "_5s6MDCOZJx4",
    "outputId": "a7d96918-dbad-4652-c3c6-6ba616aa940a"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch: 01 | Epoch Time: 0m 14s\n",
      "\tTrain Loss: 0.694 | Train Acc: 50.33%\n",
      "\t Val. Loss: 0.698 |  Val. Acc: 49.19%\n",
      "Epoch: 02 | Epoch Time: 0m 13s\n",
      "\tTrain Loss: 0.693 | Train Acc: 49.62%\n",
      "\t Val. Loss: 0.698 |  Val. Acc: 49.13%\n",
      "Epoch: 03 | Epoch Time: 0m 14s\n",
      "\tTrain Loss: 0.693 | Train Acc: 50.09%\n",
      "\t Val. Loss: 0.698 |  Val. Acc: 50.36%\n",
      "Epoch: 04 | Epoch Time: 0m 14s\n",
      "\tTrain Loss: 0.693 | Train Acc: 49.56%\n",
      "\t Val. Loss: 0.698 |  Val. Acc: 49.02%\n",
      "Epoch: 05 | Epoch Time: 0m 13s\n",
      "\tTrain Loss: 0.693 | Train Acc: 50.00%\n",
      "\t Val. Loss: 0.698 |  Val. Acc: 50.55%\n"
     ]
    }
   ],
   "source": [
    "epochs = 5\n",
    "\n",
    "best_valid_loss = 0\n",
    "for epoch in range(epochs):    \n",
    "    start_time = time.time()     \n",
    "    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)\n",
    "    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)    \n",
    "    end_time = time.time()\n",
    "    epoch_mins, epoch_secs = epoch_time(start_time, end_time)\n",
    "    \n",
    "    if valid_loss < best_valid_loss:\n",
    "        best_valid_loss = valid_loss\n",
    "        #torch.save(model.state_dict(), 'model.pt')\n",
    "    \n",
    "    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')\n",
    "    print(f'\\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')\n",
    "    print(f'\\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "id": "j9WYY3hTl4H_",
    "outputId": "a5c0302d-ab14-46da-f98b-d041ff4d76a8"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Test Loss: 0.715 | Test Acc: 45.90%\n"
     ]
    }
   ],
   "source": [
    "#model.load_state_dict(torch.load('model.pt'))\n",
    "test_loss, test_acc = evaluate(model, test_iterator, criterion)\n",
    "print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "name": "RNN_pytorch.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}