{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h1 style=\"text-align:center;\">Clustering via DBSCAN Algorithm</h1>\n",
    "<p style=\"text-align:center;\">\n",
    "Nazar Khan\n",
    "<br>CVML Lab\n",
    "<br>University of The Punjab\n",
    "</p>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This tutorial explains the **Density-Based Spatial Clustering of Applications with Noise (DBSCAN)** algorithm. We’ll implement it from scratch and demonstrate its behavior visually to make the concepts clear.\n",
    "\n",
    "---\n",
    "\n",
    "### **1. What is DBSCAN?**\n",
    "\n",
    "**DBSCAN** is a clustering algorithm that groups points based on their density. Unlike K-means, it does not require the number of clusters beforehand.\n",
    "\n",
    "Key concepts:\n",
    "- **Core Points**: Points with at least $minPts$ neighbors within a given distance $\\varepsilon$.\n",
    "- **Border Points**: Points that are not core points but fall within the $\\varepsilon$-neighborhood of a core point.\n",
    "- **Noise Points**: Points that are neither core nor border points.\n",
    "\n",
    "---\n",
    "\n",
    "### **2. Install Required Libraries**\n",
    "Install necessary libraries for plotting and computation:\n",
    "```bash\n",
    "pip install numpy matplotlib\n",
    "```\n",
    "\n",
    "---\n",
    "\n",
    "### **3. Imports**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from collections import deque\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "### **4. Visual Helper Functions**\n",
    "We’ll write helper functions to visualize points, clusters, and steps of DBSCAN."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_points(data, labels=None, title=\"DBSCAN Clustering\", core_samples=None):\n",
    "    plt.figure(figsize=(8, 6))\n",
    "    unique_labels = set(labels) if labels is not None else [-1]\n",
    "    \n",
    "    colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))\n",
    "    \n",
    "    for label, color in zip(unique_labels, colors):\n",
    "        if label == -1:\n",
    "            color = [0, 0, 0, 0.5]  # Black for noise\n",
    "        \n",
    "        # Select points with the current label\n",
    "        points = data[labels == label]\n",
    "        plt.scatter(points[:, 0], points[:, 1], c=[color], s=50, label=f\"Cluster {label}\" if label != -1 else \"Noise\")\n",
    "    \n",
    "    if core_samples is not None:\n",
    "        plt.scatter(core_samples[:, 0], core_samples[:, 1], c='yellow', edgecolors='k', s=100, label=\"Core Points\")\n",
    "    \n",
    "    plt.title(title)\n",
    "    plt.legend()\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "### **5. Distance Calculation**\n",
    "We need a function to compute the Euclidean distance between two points."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def euclidean_distance(p1, p2):\n",
    "    return np.sqrt(np.sum((p1 - p2) ** 2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "### **6. DBSCAN Implementation**\n",
    "\n",
    "Here’s the full implementation of DBSCAN."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def dbscan(data, eps, min_pts):\n",
    "    \"\"\"\n",
    "    DBSCAN implementation.\n",
    "    Parameters:\n",
    "        data: ndarray - Input data points (NxM).\n",
    "        eps: float - Radius of neighborhood.\n",
    "        min_pts: int - Minimum number of points to form a dense region.\n",
    "    Returns:\n",
    "        labels: list - Cluster labels for each point (-1 means noise).\n",
    "    \"\"\"\n",
    "    labels = np.full(len(data), -1)  # Initialize all points as noise (-1)\n",
    "    cluster_id = 0  # Start cluster IDs\n",
    "    \n",
    "    def region_query(point_idx):\n",
    "        \"\"\"Find all neighbors of a point within radius eps.\"\"\"\n",
    "        neighbors = []\n",
    "        for i in range(len(data)):\n",
    "            if euclidean_distance(data[point_idx], data[i]) <= eps:\n",
    "                neighbors.append(i)\n",
    "        return neighbors\n",
    "    \n",
    "    def expand_cluster(point_idx, neighbors):\n",
    "        \"\"\"Expand cluster for a core point.\"\"\"\n",
    "        labels[point_idx] = cluster_id\n",
    "        queue = deque(neighbors)  # Use a queue to process all neighbors\n",
    "        \n",
    "        while queue:\n",
    "            neighbor_idx = queue.popleft()\n",
    "            \n",
    "            if labels[neighbor_idx] == -1:  # Previously labeled as noise, now a border point\n",
    "                labels[neighbor_idx] = cluster_id\n",
    "                \n",
    "            if labels[neighbor_idx] == -1 or labels[neighbor_idx] == 0:\n",
    "                labels[neighbor_idx] = cluster_id\n",
    "                new_neighbors = region_query(neighbor_idx)\n",
    "                if len(new_neighbors) >= min_pts:  # If it's a core point, expand further\n",
    "                    queue.extend(new_neighbors)\n",
    "    \n",
    "    for i in range(len(data)):\n",
    "        if labels[i] != -1:  # Skip already processed points\n",
    "            continue\n",
    "        \n",
    "        neighbors = region_query(i)  # Find neighbors\n",
    "        if len(neighbors) < min_pts:\n",
    "            labels[i] = -1  # Noise\n",
    "        else:\n",
    "            cluster_id += 1\n",
    "            expand_cluster(i, neighbors)\n",
    "    \n",
    "    return labels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "### **7. Visual Demonstration**\n",
    "\n",
    "We’ll generate synthetic 2D data to demonstrate DBSCAN.\n",
    "\n",
    "#### Generate Data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import make_moons, make_blobs\n",
    "\n",
    "# Generate synthetic data: moons + random noise\n",
    "data, _ = make_moons(n_samples=300, noise=0.07, random_state=42)\n",
    "plt.scatter(data[:, 0], data[:, 1], s=50)\n",
    "plt.title(\"Raw Data\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "#### Run DBSCAN and Visualize:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# DBSCAN parameters\n",
    "eps = 0.2\n",
    "min_pts = 5\n",
    "\n",
    "# Run DBSCAN\n",
    "labels = dbscan(data, eps, min_pts)\n",
    "\n",
    "# Find core points (optional, for visualization)\n",
    "unique_labels = set(labels)\n",
    "core_samples = np.array([data[i] for i in range(len(data)) if len([j for j in range(len(data)) if euclidean_distance(data[i], data[j]) <= eps]) >= min_pts])\n",
    "\n",
    "# Visualize clusters\n",
    "plot_points(data, labels=np.array(labels), title=\"DBSCAN Clustering Results\", core_samples=core_samples)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "### **8. Step-by-Step Understanding with Visualization**\n",
    "\n",
    "To show **intermediate results**, adjust `plot_points` calls inside the `expand_cluster` function to visualize:\n",
    "- Points being processed.\n",
    "- Clusters expanding step-by-step.\n",
    "\n",
    "---\n",
    "\n",
    "### **9. Key Parameters**\n",
    "- **$\\varepsilon$ (eps)**: Radius of the neighborhood.\n",
    "- **$minPts$**: Minimum points required to form a dense region.\n",
    "\n",
    "---\n",
    "\n",
    "### **10. Advantages of DBSCAN**\n",
    "1. Does not require the number of clusters $K$ as input.\n",
    "2. Handles arbitrary cluster shapes.\n",
    "3. Identifies noise points.\n",
    "\n",
    "### **11. Disadvantages of DBSCAN**\n",
    "1. Performance may suffer in high dimensions.\n",
    "2. Sensitive to $\\varepsilon$ and $minPts$.\n",
    "\n",
    "---\n",
    "\n",
    "### **Output**\n",
    "1. **Raw Data Visualization**:\n",
    "   - Randomly scattered points (e.g., moons).\n",
    "2. **Clustered Output**:\n",
    "   - Points assigned to clusters.\n",
    "   - Noise points marked in black.\n",
    "\n",
    "---\n",
    "\n",
    "### **Exercises**\n",
    "1. Experiment with $\\varepsilon$ and $minPts$ values to observe changes in clustering.\n",
    "2. Use other synthetic datasets like `make_blobs`.\n",
    "3. Visualize step-by-step expansion of clusters for better understanding.\n",
    "\n",
    "This tutorial provides an intuitive understanding of DBSCAN with clear code and visual examples."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "cvml",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.12.7"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}