# <center>CS568:Deep Learning</center>  <center>Spring 2020</center> 

## Sentiment Analysis Using LSTM
In this recitation, we will use LSTM cell to classify the sentiment of a piece of text using Pytorch.

The changes we make in the previous notebook are:
+ Used packed padded sequences
+ Pre-trained word embeddings
+ LSTM cell 
+ Dropout
+ Adam optimizer

### Load dataset
Pytorch uses torchtext to preprocess raw text data. NLP projects require these steps for preprocessing:

+ Read the data from disk
+ Tokenize the text
+ Create a mapping from word to a unique integer
+ Convert the text into lists of integers
+ Load the data in whatever format your deep learning framework requires
+ Pad the text so that all the sequences are the same length, so you can process them in batch

**Torchtext** is a library that makes all of the above processing much easier. 

**[Spacy](https://spacy.io/)** is a library that has been specifically built to take sentences in various languages and split them into different tokens.

how to tokenize the text using Torchtext and Spacy?

![alt text](1.png)

Packed padded sequences will make RNN or LSTM only process the non-padded elements of sequence, and for any padded element the output will be a zero tensor.

To use packed padded sequences, need to record the length of the actual sentences. 

Use include_lengths = True for our TEXT field, it will return tuple with the first element being our sentence (a numericalized tensor that has been padded) and the second element being the actual lengths of our sentences.

In [1]:
import torch
import random
from torchtext import data
seed = 1234

torch.manual_seed(seed)
text = data.Field(tokenize = 'spacy', include_lengths = True)
label = data.LabelField(dtype = torch.float)

Download IMDB dataset with using torchtext. The IMDb dataset consists of 50,000 movie reviews, each marked as being a positive or negative review. This code automatically download dataset with train and test splits. 

In [2]:
from torchtext import datasets

training_data, testing_data = datasets.IMDB.splits(text, label)
print('Number of training examples: ',len(training_data))
print('Number of testing examples: ', len(testing_data))

aclImdb_v1.tar.gz:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:17<00:00, 4.78MB/s]


Number of training examples:  25000
Number of testing examples:  25000


In [3]:
print(vars(training_data.examples[0]))

{'text': ['As', 'a', 'lover', 'of', 'the', 'surreal', '(', 'in', 'art', 'and', 'film', ')', 'I', 'was', 'pleased', 'to', 'discover', 'this', 'film', 'on', 'IFC', '.', 'It', 'is', 'definitely', 'a', 'keeper', '.', 'Most', 'of', 'the', 'other', 'reviews', 'tell', 'the', 'general', 'plot', '(', 'not', 'all', 'correct', ')', 'so', 'I', 'wo', "n't", 'bother', 'to', 'bore', 'anyone', 'with', 'that', '.', 'The', 'main', 'thing', 'is', 'the', 'alternate', 'worlds', 'concept', 'which', 'is', 'brought', 'on', 'by', 'Ana', "'s", 'impending', 'illness', ',', 'and', 'the', 'way', 'she', 'manages', 'to', 'link', 'with', 'someone', 'else', 'after', 'being', 'so', '"', 'alone', '"', ',', 'and', 'finally', 'with', 'her', 'family', ',', 'which', 'I', 'believe', 'is', 'still', 'at', 'least', 'a', 'little', 'troubled', '.', 'It', 'only', 'can', 'be', 'called', 'a', 'horror', 'movie', 'in', 'that', 'it', 'has', 'frightening', 'scenes', 'but', 'is', 'a', 'fantasy', '(', 'with', 'a', 'little', 'hint', 'of', 

Create a validation set using .split() method. 

In [4]:
training_data, validation_data = training_data.split(random_state = random.seed(seed), split_ratio = 0.7)
# split ratio to split training data into train and validation sets. By default, it splits them with 0.7 ratio. 
print('Number of training examples: ',len(training_data))
print('Number of validation examples: ',len(validation_data))
print('Number of testing examples: ',len(testing_data))

Number of training examples:  17500
Number of validation examples:  7500
Number of testing examples:  25000


**Word Embeddings**
+ learned representaton of words
+ similar meaning - similar representation
+ dense
+ low dimensional representation

**Pre-trained word embeddings**

Use pre-trained embeddings instead of having word embedding initialized randomly. TorchText handles downloading the vectors and associating them with the correct words in our vocabulary. TorchText handles downloading the vectors and associating them with the correct words in our vocabulary. 

We will use the **"glove.6B.100d"** vectors. Glove is the algorithm used to calculate the vectors (see [here](https://nlp.stanford.edu/projects/glove/)). 6B indicates these vectors were trained on 6 billion tokens and 100d indicates these vectors are 100-dimensional.


In [5]:
vocab_size = 25_000
text.build_vocab(training_data, max_size = vocab_size, vectors = "glove.6B.100d", unk_init = torch.Tensor.normal_)
label.build_vocab(training_data)

print('Unique tokens in TEXT vocabulary: ',len(text.vocab))
print('Unique tokens in LABEL vocabulary: ',len(label.vocab))

.vector_cache/glove.6B.zip: 862MB [06:29, 2.21MB/s]                           
100%|█████████▉| 399999/400000 [00:23<00:00, 16913.64it/s]


Unique tokens in TEXT vocabulary:  25002
Unique tokens in LABEL vocabulary:  2


Why is the vocab size 25002 and not 25000? One of the addition tokens is the < unk > token and the other is a < pad > token.
    
Torchtext has its own class called vocab for handling the vocabulary. The vocab class holds a mapping from word to id in its **stoi** attribute and a reverse mapping in its **itos** attribute. In addition to this, it can automatically build an embedding matrix for you using various pretrained embeddings like word2vec.

Print most common 10 words from the vocab. 

In [6]:
print(text.vocab.freqs.most_common(5))

[('the', 202874), (',', 192507), ('.', 166290), ('a', 109355), ('and', 109273)]


To view the vocab., use stoi (string to int) or itos (int to string) methods. 

In [7]:
print(text.vocab.itos[:10])
print(label.vocab.stoi)

['<unk>', '<pad>', 'the', ',', '.', 'a', 'and', 'of', 'to', 'is']
defaultdict(None, {'neg': 0, 'pos': 1})


Use **BuketIterator** to return batches of similar length sentences. 

In [8]:
batch_size = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (training_data, validation_data, testing_data),  batch_size = batch_size,
    sort_within_batch = True,
    device = device)

### Define model

The LSTM looks like:

![alt text](3.png)

LSTM returns the output and a tuple of the final hidden state and then final cell state whereas the RNN returns the output and the final hidden state. 

To use packed padded sequences, use nn.utils.rnn.packed_padded_sequence which process the non-padded elements of sequence. The RNN will then return packed_output (a packed sequence) as well as the hidden and cell states tensors. 



In [9]:
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, dropout, pad_idx):        
        super().__init__()        
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx)        
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, dropout=dropout)        
        self.fc = nn.Linear(hidden_dim, output_dim)        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text, text_lengths):        
        #text = [sent len, batch size]        
        embedded = self.dropout(self.embedding(text))        
        #embedded = [sent len, batch size, emb dim]       
        
        #pack sequence
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths)        
        packed_output, (hidden, cell) = self.rnn(packed_embedded)
        
        #unpack sequence
        output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)

        #output = [sent len, batch size, hid dim]
        #output over padding tokens are zero tensors        
        #hidden = [batch size, hid dim]
        #cell = [batch size, hid dim]        
            
        hidden = self.dropout(hidden)                
        #hidden = [batch size, hid dim]  
        out = self.fc(hidden)
        return out

To ensure the pre-trained vectors can be loaded into the model, the embedding_dim must be equal to the pre-trained GloVe vectors loaded earlier.

In [10]:
input_dim = len(text.vocab)
embedding_dim = 100
hidden_dim = 256
output_dim = 1
dropout = 0.5
pad_idx = text.vocab.stoi[text.pad_token] # indices of pad token

model = RNN(input_dim, embedding_dim, hidden_dim, output_dim, dropout, pad_idx)

  "num_layers={}".format(dropout, num_layers))


In [11]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 2,867,049 trainable parameters


Replace the initial weights of the embedding layer with the pre-trained embeddings.

In [12]:
pretrained_embeddings = text.vocab.vectors
print("pretrained embedding shape:",pretrained_embeddings.shape)
model.embedding.weight.data.copy_(pretrained_embeddings)

pretrained embedding shape: torch.Size([25002, 100])


tensor([[-0.1117, -0.4966,  0.1631,  ...,  1.2647, -0.2753, -0.1325],
        [-0.8555, -0.7208,  1.3755,  ...,  0.0825, -1.1314,  0.3997],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.4979, -0.8359, -0.3487,  ...,  0.3010,  0.8162, -0.2722],
        [-0.5121,  0.2059,  0.5721,  ..., -0.7374, -0.2360,  0.3210],
        [ 0.9147,  0.2111, -0.0431,  ...,  0.3362,  0.3323,  0.0240]])

Set the < unk > and < pad > tokens in the embedding matrix with zeros.

In [13]:
unk_idx = text.vocab.stoi[text.unk_token]
model.embedding.weight.data[unk_idx] = torch.zeros(embedding_dim)
model.embedding.weight.data[pad_idx] = torch.zeros(embedding_dim)
print(model.embedding.weight.data)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.4979, -0.8359, -0.3487,  ...,  0.3010,  0.8162, -0.2722],
        [-0.5121,  0.2059,  0.5721,  ..., -0.7374, -0.2360,  0.3210],
        [ 0.9147,  0.2111, -0.0431,  ...,  0.3362,  0.3323,  0.0240]])


In [14]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters())
criterion = nn.BCEWithLogitsLoss()
model = model.to(device)
criterion = criterion.to(device)

In [15]:
def binary_accuracy(preds, y):
    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum() / len(correct)
    return acc

batch.text is now a tuple with the first element being the numericalized tensor and the second element being the actual lengths of each sequence. 

In [16]:
def train(model, iterator, optimizer, criterion):    
    epoch_loss = 0
    epoch_acc = 0    
    model.train()    
    for batch in iterator:        
        optimizer.zero_grad()        
        text, text_lengths = batch.text        
        predictions = model(text, text_lengths).squeeze()        
        loss = criterion(predictions, batch.label)        
        acc = binary_accuracy(predictions, batch.label)        
        loss.backward()        
        optimizer.step()        
        epoch_loss += loss.item()
        epoch_acc += acc.item()        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [17]:
def evaluate(model, iterator, criterion):    
    epoch_loss = 0
    epoch_acc = 0    
    model.eval()    
    with torch.no_grad():   
        for batch in iterator:
            text, text_lengths = batch.text            
            predictions = model(text, text_lengths).squeeze()            
            loss = criterion(predictions, batch.label)            
            acc = binary_accuracy(predictions, batch.label)
            epoch_loss += loss.item()
            epoch_acc += acc.item()        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [18]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [19]:
num_epochs = 5
best_valid_loss = 0.0
for epoch in range(num_epochs):
    start_time = time.time()    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)    
    end_time = time.time()
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'sentiment-analyis-RNN-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 0m 11s
	Train Loss: 0.689 | Train Acc: 55.08%
	 Val. Loss: 0.672 |  Val. Acc: 59.33%
Epoch: 02 | Epoch Time: 0m 10s
	Train Loss: 0.662 | Train Acc: 60.52%
	 Val. Loss: 0.623 |  Val. Acc: 66.60%
Epoch: 03 | Epoch Time: 0m 10s
	Train Loss: 0.571 | Train Acc: 71.95%
	 Val. Loss: 0.500 |  Val. Acc: 78.57%
Epoch: 04 | Epoch Time: 0m 11s
	Train Loss: 0.494 | Train Acc: 77.18%
	 Val. Loss: 0.539 |  Val. Acc: 75.42%
Epoch: 05 | Epoch Time: 0m 10s
	Train Loss: 0.377 | Train Acc: 83.82%
	 Val. Loss: 0.303 |  Val. Acc: 87.79%


In [20]:
test_loss, test_acc = evaluate(model, test_iterator, criterion)
print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.320 | Test Acc: 86.46%
