### **Reinforcement Learning in PyTorch: A Tutorial from Scratch**

This tutorial introduces reinforcement learning (RL) using PyTorch. We'll focus on Q-Learning with Deep Q-Networks (DQN) to teach an agent how to navigate a simple environment. For demonstration, we'll use the `CartPole-v1` environment from OpenAI's Gym.

---

### **What is Reinforcement Learning?**
Reinforcement Learning is a framework where an agent interacts with an environment to learn a policy \( \pi(s) \), mapping states (\(s\)) to actions (\(a\)), by maximizing cumulative rewards.

Key terms:
- **State**: The current representation of the environment.
- **Action**: The agent's decision.
- **Reward**: Feedback from the environment.
- **Policy**: Strategy for choosing actions.

---

### **Step-by-Step Implementation**
#### **1. Setup**
Install required libraries:
```bash
pip install gym torch matplotlib
```



#### **2. Imports**

In [None]:
import gym
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import numpy as np
from collections import deque
import random
import matplotlib.pyplot as plt


---

#### **3. Create the Neural Network**
The DQN approximates the Q-value function \( Q(s, a) \).



In [None]:
class DQN(nn.Module):
 def __init__(self, state_dim, action_dim):
 super(DQN, self).__init__()
 self.fc1 = nn.Linear(state_dim, 128)
 self.fc2 = nn.Linear(128, 128)
 self.fc3 = nn.Linear(128, action_dim)

 def forward(self, x):
 x = F.relu(self.fc1(x))
 x = F.relu(self.fc2(x))
 x = self.fc3(x) # Q-values for all actions
 return x



---

#### **4. Replay Buffer**
Experience replay stores past transitions to improve sample efficiency.

In [None]:
class ReplayBuffer:
 def __init__(self, capacity):
 self.buffer = deque(maxlen=capacity)

 def push(self, state, action, reward, next_state, done):
 self.buffer.append((state, action, reward, next_state, done))

 def sample(self, batch_size):
 batch = random.sample(self.buffer, batch_size)
 states, actions, rewards, next_states, dones = zip(*batch)
 return (np.array(states), np.array(actions), np.array(rewards),
 np.array(next_states), np.array(dones))

 def __len__(self):
 return len(self.buffer)


---

#### **5. Epsilon-Greedy Policy**
The agent explores the environment while balancing exploration and exploitation.

In [None]:
def epsilon_greedy_policy(state, epsilon, model, action_dim):
 if random.random() < epsilon:
 return random.randint(0, action_dim - 1) # Explore
 state = torch.FloatTensor(state).unsqueeze(0)
 with torch.no_grad():
 q_values = model(state)
 return q_values.argmax().item() # Exploit

---

#### **6. Train the Agent**
Define the main training loop for DQN.

In [None]:
def train_dqn(env, num_episodes=500, batch_size=64, gamma=0.99, epsilon_decay=0.995, min_epsilon=0.01):
 state_dim = env.observation_space.shape[0]
 action_dim = env.action_space.n
 
 # Initialize DQN and target networks
 dqn = DQN(state_dim, action_dim)
 target_dqn = DQN(state_dim, action_dim)
 target_dqn.load_state_dict(dqn.state_dict())
 optimizer = optim.Adam(dqn.parameters(), lr=0.001)

 replay_buffer = ReplayBuffer(capacity=10000)

 epsilon = 1.0
 rewards_per_episode = []

 for episode in range(num_episodes):
 state = env.reset()
 episode_reward = 0
 
 while True:
 # Choose an action using epsilon-greedy policy
 action = epsilon_greedy_policy(state, epsilon, dqn, action_dim)
 next_state, reward, done, _ = env.step(action)
 episode_reward += reward

 # Store transition in replay buffer
 replay_buffer.push(state, action, reward, next_state, done)
 state = next_state

 # Train the model if enough data is available
 if len(replay_buffer) >= batch_size:
 states, actions, rewards, next_states, dones = replay_buffer.sample(batch_size)

 states = torch.FloatTensor(states)
 actions = torch.LongTensor(actions).unsqueeze(1)
 rewards = torch.FloatTensor(rewards)
 next_states = torch.FloatTensor(next_states)
 dones = torch.FloatTensor(dones)

 # Compute target Q-values
 with torch.no_grad():
 target_q_values = rewards + gamma * (1 - dones) * target_dqn(next_states).max(1)[0]

 # Compute current Q-values
 current_q_values = dqn(states).gather(1, actions).squeeze()

 # Loss: MSE
 loss = F.mse_loss(current_q_values, target_q_values)

 optimizer.zero_grad()
 loss.backward()
 optimizer.step()

 if done:
 break

 # Update target network periodically
 if episode % 10 == 0:
 target_dqn.load_state_dict(dqn.state_dict())

 # Decay epsilon
 epsilon = max(min_epsilon, epsilon * epsilon_decay)

 rewards_per_episode.append(episode_reward)
 print(f"Episode {episode}, Reward: {episode_reward}, Epsilon: {epsilon:.2f}")

 return rewards_per_episode, dqn

---

#### **7. Visualize Training**

In [None]:

# Train the model
env = gym.make('CartPole-v1')
rewards, trained_dqn = train_dqn(env)

# Plot rewards
plt.plot(rewards)
plt.xlabel("Episode")
plt.ylabel("Reward")
plt.title("Training Rewards")
plt.show()


---

### **Results**
Run the code to observe:
- A graph of rewards improving over episodes.
- The agent balancing the pole for increasing durations.

---

### **Exercises**
1. Experiment with different network architectures.
2. Change hyperparameters like learning rate or replay buffer size.
3. Test the trained model in the environment using a greedy policy.

---

This tutorial introduces RL with a simple DQN implementation in PyTorch, laying the foundation for more complex algorithms like Double DQN or Policy Gradient methods.