{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "

Clustering via DBSCAN Algorithm

\n", "

\n", "Nazar Khan\n", "
CVML Lab\n", "
University of The Punjab\n", "

" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "This tutorial explains the **Density-Based Spatial Clustering of Applications with Noise (DBSCAN)** algorithm. We’ll implement it from scratch and demonstrate its behavior visually to make the concepts clear.\n", "\n", "---\n", "\n", "### **1. What is DBSCAN?**\n", "\n", "**DBSCAN** is a clustering algorithm that groups points based on their density. Unlike K-means, it does not require the number of clusters beforehand.\n", "\n", "Key concepts:\n", "- **Core Points**: Points with at least $minPts$ neighbors within a given distance $\\varepsilon$.\n", "- **Border Points**: Points that are not core points but fall within the $\\varepsilon$-neighborhood of a core point.\n", "- **Noise Points**: Points that are neither core nor border points.\n", "\n", "---\n", "\n", "### **2. Install Required Libraries**\n", "Install necessary libraries for plotting and computation:\n", "```bash\n", "pip install numpy matplotlib\n", "```\n", "\n", "---\n", "\n", "### **3. Imports**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from collections import deque\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "### **4. Visual Helper Functions**\n", "We’ll write helper functions to visualize points, clusters, and steps of DBSCAN." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plot_points(data, labels=None, title=\"DBSCAN Clustering\", core_samples=None):\n", " plt.figure(figsize=(8, 6))\n", " unique_labels = set(labels) if labels is not None else [-1]\n", " \n", " colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))\n", " \n", " for label, color in zip(unique_labels, colors):\n", " if label == -1:\n", " color = [0, 0, 0, 0.5] # Black for noise\n", " \n", " # Select points with the current label\n", " points = data[labels == label]\n", " plt.scatter(points[:, 0], points[:, 1], c=[color], s=50, label=f\"Cluster {label}\" if label != -1 else \"Noise\")\n", " \n", " if core_samples is not None:\n", " plt.scatter(core_samples[:, 0], core_samples[:, 1], c='yellow', edgecolors='k', s=100, label=\"Core Points\")\n", " \n", " plt.title(title)\n", " plt.legend()\n", " plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "### **5. Distance Calculation**\n", "We need a function to compute the Euclidean distance between two points." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def euclidean_distance(p1, p2):\n", " return np.sqrt(np.sum((p1 - p2) ** 2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "### **6. DBSCAN Implementation**\n", "\n", "Here’s the full implementation of DBSCAN." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def dbscan(data, eps, min_pts):\n", " \"\"\"\n", " DBSCAN implementation.\n", " Parameters:\n", " data: ndarray - Input data points (NxM).\n", " eps: float - Radius of neighborhood.\n", " min_pts: int - Minimum number of points to form a dense region.\n", " Returns:\n", " labels: list - Cluster labels for each point (-1 means noise).\n", " \"\"\"\n", " labels = np.full(len(data), -1) # Initialize all points as noise (-1)\n", " cluster_id = 0 # Start cluster IDs\n", " \n", " def region_query(point_idx):\n", " \"\"\"Find all neighbors of a point within radius eps.\"\"\"\n", " neighbors = []\n", " for i in range(len(data)):\n", " if euclidean_distance(data[point_idx], data[i]) <= eps:\n", " neighbors.append(i)\n", " return neighbors\n", " \n", " def expand_cluster(point_idx, neighbors):\n", " \"\"\"Expand cluster for a core point.\"\"\"\n", " labels[point_idx] = cluster_id\n", " queue = deque(neighbors) # Use a queue to process all neighbors\n", " \n", " while queue:\n", " neighbor_idx = queue.popleft()\n", " \n", " if labels[neighbor_idx] == -1: # Previously labeled as noise, now a border point\n", " labels[neighbor_idx] = cluster_id\n", " \n", " if labels[neighbor_idx] == -1 or labels[neighbor_idx] == 0:\n", " labels[neighbor_idx] = cluster_id\n", " new_neighbors = region_query(neighbor_idx)\n", " if len(new_neighbors) >= min_pts: # If it's a core point, expand further\n", " queue.extend(new_neighbors)\n", " \n", " for i in range(len(data)):\n", " if labels[i] != -1: # Skip already processed points\n", " continue\n", " \n", " neighbors = region_query(i) # Find neighbors\n", " if len(neighbors) < min_pts:\n", " labels[i] = -1 # Noise\n", " else:\n", " cluster_id += 1\n", " expand_cluster(i, neighbors)\n", " \n", " return labels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "### **7. Visual Demonstration**\n", "\n", "We’ll generate synthetic 2D data to demonstrate DBSCAN.\n", "\n", "#### Generate Data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import make_moons, make_blobs\n", "\n", "# Generate synthetic data: moons + random noise\n", "data, _ = make_moons(n_samples=300, noise=0.07, random_state=42)\n", "plt.scatter(data[:, 0], data[:, 1], s=50)\n", "plt.title(\"Raw Data\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "#### Run DBSCAN and Visualize:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# DBSCAN parameters\n", "eps = 0.2\n", "min_pts = 5\n", "\n", "# Run DBSCAN\n", "labels = dbscan(data, eps, min_pts)\n", "\n", "# Find core points (optional, for visualization)\n", "unique_labels = set(labels)\n", "core_samples = np.array([data[i] for i in range(len(data)) if len([j for j in range(len(data)) if euclidean_distance(data[i], data[j]) <= eps]) >= min_pts])\n", "\n", "# Visualize clusters\n", "plot_points(data, labels=np.array(labels), title=\"DBSCAN Clustering Results\", core_samples=core_samples)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "### **8. Step-by-Step Understanding with Visualization**\n", "\n", "To show **intermediate results**, adjust `plot_points` calls inside the `expand_cluster` function to visualize:\n", "- Points being processed.\n", "- Clusters expanding step-by-step.\n", "\n", "---\n", "\n", "### **9. Key Parameters**\n", "- **$\\varepsilon$ (eps)**: Radius of the neighborhood.\n", "- **$minPts$**: Minimum points required to form a dense region.\n", "\n", "---\n", "\n", "### **10. Advantages of DBSCAN**\n", "1. Does not require the number of clusters $K$ as input.\n", "2. Handles arbitrary cluster shapes.\n", "3. Identifies noise points.\n", "\n", "### **11. Disadvantages of DBSCAN**\n", "1. Performance may suffer in high dimensions.\n", "2. Sensitive to $\\varepsilon$ and $minPts$.\n", "\n", "---\n", "\n", "### **Output**\n", "1. **Raw Data Visualization**:\n", " - Randomly scattered points (e.g., moons).\n", "2. **Clustered Output**:\n", " - Points assigned to clusters.\n", " - Noise points marked in black.\n", "\n", "---\n", "\n", "### **Exercises**\n", "1. Experiment with $\\varepsilon$ and $minPts$ values to observe changes in clustering.\n", "2. Use other synthetic datasets like `make_blobs`.\n", "3. Visualize step-by-step expansion of clusters for better understanding.\n", "\n", "This tutorial provides an intuitive understanding of DBSCAN with clear code and visual examples." ] } ], "metadata": { "kernelspec": { "display_name": "cvml", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.12.7" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }