{ "cells": [ { "cell_type": "markdown", "id": "b2588024", "metadata": {}, "source": [ "

Machine Learning: Linear Regression

\n", "

\n", "Nazar Khan\n", "
CVML Lab\n", "
University of The Punjab\n", "

" ] }, { "attachments": {}, "cell_type": "markdown", "id": "37bdfa41", "metadata": {}, "source": [ "This is a tutorial on linear regression using synthetic data from a sinusoidal curve. We will use three different types of regression models:\n", "1. Linear Regression\n", "2. Polynomial Regression, and\n", "3. Ridge Regression.\n", "\n", "Each regression model uses a different form of the loss function.\n", "\n", "This Python notebook will use common libraries like numPy, matplotlib, and scikit-learn. It will demonstrate:\n", "\n", "- Generating data with noise from a sinusoidal function.\n", "- Linear regression on noisy data.\n", "- Overfitting using higher-degree polynomial features.\n", "- Generalization by increasing the amount of data.\n", "- Regularization using Ridge regression to avoid overfitting." ] }, { "attachments": {}, "cell_type": "markdown", "id": "d8997a28", "metadata": {}, "source": [ "### Step 1: Generate Sinusoidal Data" ] }, { "cell_type": "code", "execution_count": null, "id": "8ef206aa", "metadata": {}, "outputs": [], "source": [ "\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from sklearn.preprocessing import PolynomialFeatures\n", "from sklearn.linear_model import LinearRegression, Ridge\n", "from sklearn.metrics import mean_squared_error\n", "\n", "# Set random seed for reproducibility\n", "np.random.seed(42)\n", "\n", "# Generate true sine wave\n", "X_true = np.linspace(-np.pi, np.pi, 100).reshape(-1, 1)\n", "y_true = np.sin(X_true)\n", "\n", "# Add noise to the data for training\n", "def generate_noisy_sinusoidal_data(n_points, noise_std):\n", " X = np.linspace(-np.pi, np.pi, n_points).reshape(-1, 1)\n", " y = np.sin(X) + np.random.normal(0, noise_std, size=X.shape)\n", " return X, y\n", "\n", "# Training data (10 noisy points)\n", "X_train, y_train = generate_noisy_sinusoidal_data(10, noise_std=0.2)\n", "\n", "# Testing data (90 noisy points)\n", "X_test, y_test = generate_noisy_sinusoidal_data(90, noise_std=0.2)\n", "\n", "# Plot the true curve, training data, and testing data\n", "plt.figure(figsize=(10, 6))\n", "plt.plot(X_true, y_true, label='True Sine Wave', color='blue')\n", "plt.scatter(X_train, y_train, color='red', label='Training Data (10 points)')\n", "plt.scatter(X_test, y_test, color='green', label='Testing Data (90 points)', alpha=0.5)\n", "plt.legend()\n", "plt.title('Training and Testing Data with Noise')\n", "plt.show()\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "c77bf915", "metadata": {}, "source": [ "### Step 2: Linear Regression and Overfitting Example\n", "\n", "Linear regression tries to minimize the Mean Squared Error (MSE) between the predicted values $\\hat{y} and the actual values $y$.\n", "\n", "Loss Function:\n", "\n", "$\\text{MSE} = \\frac{1}{n} \\sum_{i=1}^{n} \\left( y_i - \\hat{y}_i \\right)^2$\n", "\n", "where:\n", "\n", "- $y_i$ are the true target values.\n", "- $\\hat{y}_i$ are the predicted values.\n", "- $n$ is the number of data points.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "79b2bac8", "metadata": {}, "outputs": [], "source": [ "# Perform linear regression (degree=1)\n", "poly_features = PolynomialFeatures(degree=1)\n", "X_train_poly = poly_features.fit_transform(X_train)\n", "X_test_poly = poly_features.fit_transform(X_test)\n", "X_true_poly = poly_features.fit_transform(X_true)\n", "\n", "linear_regressor = LinearRegression()\n", "linear_regressor.fit(X_train_poly, y_train)\n", "\n", "# Predict for both train and test data\n", "y_train_pred = linear_regressor.predict(X_train_poly)\n", "y_test_pred = linear_regressor.predict(X_test_poly)\n", "y_true_pred = linear_regressor.predict(X_true_poly)\n", "\n", "# Plot the true curve, training points, and linear fit\n", "plt.figure(figsize=(10, 6))\n", "plt.plot(X_true, y_true, label='True Sine Wave', color='blue')\n", "plt.scatter(X_train, y_train, color='red', label='Training Data (10 points)')\n", "plt.plot(X_true, y_true_pred, label='Linear Fit (Degree 1)', color='black')\n", "plt.legend()\n", "plt.title('Linear Regression (Degree 1)')\n", "plt.show()\n", "\n", "# Evaluate the error on both train and test sets\n", "train_error = mean_squared_error(y_train, y_train_pred)\n", "test_error = mean_squared_error(y_test, y_test_pred)\n", "print(f\"Linear Regression (Degree 1) - Training Error: {train_error:.4f}, Test Error: {test_error:.4f}\")\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "a923f350", "metadata": {}, "source": [ "### Step 3: Overfitting using Higher Degrees\n", "\n", "Polynomial regression is essentially linear regression with polynomial features. The model still minimizes the MSE, but on a transformed feature space (polynomial basis).\n", "\n", "Loss Function:\n", "\n", "$\\text{MSE} = \\frac{1}{n} \\sum_{i=1}^{n} \\left( y_i - \\hat{y}_i \\right)^2$\n", " \n", "The difference is that $\\hat{y}_i$ comes from a polynomial model (e.g., degree 9 in the example). The MSE is still calculated the same way, but the model is more complex." ] }, { "cell_type": "code", "execution_count": null, "id": "82a1c989", "metadata": {}, "outputs": [], "source": [ "# Try polynomial regression with a higher degree (degree=9)\n", "poly_features = PolynomialFeatures(degree=9)\n", "X_train_poly = poly_features.fit_transform(X_train)\n", "X_test_poly = poly_features.fit_transform(X_test)\n", "X_true_poly = poly_features.fit_transform(X_true)\n", "\n", "poly_regressor = LinearRegression()\n", "poly_regressor.fit(X_train_poly, y_train)\n", "\n", "# Predict for both train and test data\n", "y_train_pred_poly = poly_regressor.predict(X_train_poly)\n", "y_test_pred_poly = poly_regressor.predict(X_test_poly)\n", "y_true_pred_poly = poly_regressor.predict(X_true_poly)\n", "\n", "# Plot the true curve, training points, and high-degree polynomial fit\n", "plt.figure(figsize=(10, 6))\n", "plt.plot(X_true, y_true, label='True Sine Wave', color='blue')\n", "plt.scatter(X_train, y_train, color='red', label='Training Data (10 points)')\n", "plt.plot(X_true, y_true_pred_poly, label='Polynomial Fit (Degree 9)', color='black')\n", "plt.legend()\n", "plt.title('Overfitting: Polynomial Regression (Degree 9)')\n", "plt.show()\n", "\n", "# Evaluate the error on both train and test sets\n", "train_error_poly = mean_squared_error(y_train, y_train_pred_poly)\n", "test_error_poly = mean_squared_error(y_test, y_test_pred_poly)\n", "print(f\"Polynomial Regression (Degree 9) - Training Error: {train_error_poly:.4f}, Test Error: {test_error_poly:.4f}\")\n", "print(\"Learned parameters:\\n\", poly_regressor.coef_)\n", "print(\"Magnitude of learned parameters vector: \", np.linalg.norm(poly_regressor.coef_))\n" ] }, { "cell_type": "markdown", "id": "79d92e01", "metadata": {}, "source": [ "### Step 4: Generalization by Using More Data" ] }, { "cell_type": "code", "execution_count": null, "id": "47c18330", "metadata": {}, "outputs": [], "source": [ "# Use both training and testing data (100 points) to prevent overfitting\n", "X_all = np.concatenate([X_train, X_test])\n", "y_all = np.concatenate([y_train, y_test])\n", "\n", "poly_features = PolynomialFeatures(degree=9)\n", "X_all_poly = poly_features.fit_transform(X_all)\n", "X_true_poly = poly_features.fit_transform(X_true)\n", "\n", "poly_regressor.fit(X_all_poly, y_all)\n", "\n", "# Predict for the full dataset\n", "y_true_pred_all = poly_regressor.predict(X_true_poly)\n", "\n", "# Plot the true curve and new polynomial fit (using 100 points)\n", "plt.figure(figsize=(10, 6))\n", "plt.plot(X_true, y_true, label='True Sine Wave', color='blue')\n", "plt.scatter(X_all, y_all, color='orange', label='All Data (100 points)')\n", "plt.plot(X_true, y_true_pred_all, label='Polynomial Fit (Degree 9, All Data)', color='black')\n", "plt.legend()\n", "plt.title('Generalization with More Data: Polynomial Regression (Degree 9)')\n", "plt.show()\n", "\n", "# Evaluate the error on the larger dataset\n", "train_error_all = mean_squared_error(y_all, poly_regressor.predict(X_all_poly))\n", "print(f\"Polynomial Regression (Degree 9, All Data) - Error: {train_error_all:.4f}\")\n", "print(\"Learned parameters:\\n\", poly_regressor.coef_)\n", "print(\"Magnitude of learned parameters vector: \", np.linalg.norm(poly_regressor.coef_))" ] }, { "attachments": {}, "cell_type": "markdown", "id": "9217ac87", "metadata": {}, "source": [ "### Step 5: Generalization Using Ridge Regularization\n", "\n", "Ridge regression modifies the linear (or polynomial) regression loss by adding a regularization term to penalize large weights. This helps prevent overfitting.\n", "\n", "Loss Function:\n", "\n", "$\\text{Ridge Loss} = \\frac{1}{n} \\sum_{i=1}^{n} \\left( y_i - \\hat{y}_i \\right)^2 + \\alpha \\sum_{j=1}^{p} w_j^2$\n", " \n", "where:\n", "\n", "- $\\frac{1}{n} \\sum_{i=1}^{n} \\left( y_i - \\hat{y}_i \\right)^2$ is the MSE.\n", "- $\\sum_{j=1}^{p} w_j^2$ is the L2 regularization term.\n", "- $\\alpha$ is a regularization hyperparameter controlling the strength of the penalty.\n", "- $w_j$ are the model parameters (weights).\n", "\n", "The regularization term penalizes large values of weights, effectively discouraging overfitting by smoothing the model's parameters.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "c6e06e18", "metadata": {}, "outputs": [], "source": [ "# Ridge regression with regularization (L2 penalty)\n", "ridge_regressor = Ridge(alpha=1.0)\n", "ridge_regressor.fit(X_train_poly, y_train)\n", "\n", "# Predict for train, test, and true data\n", "y_train_pred_ridge = ridge_regressor.predict(X_train_poly)\n", "y_test_pred_ridge = ridge_regressor.predict(X_test_poly)\n", "y_true_pred_ridge = ridge_regressor.predict(X_true_poly)\n", "\n", "# Plot the true curve, training points, and Ridge regression fit\n", "plt.figure(figsize=(10, 6))\n", "plt.plot(X_true, y_true, label='True Sine Wave', color='blue')\n", "plt.scatter(X_train, y_train, color='red', label='Training Data (10 points)')\n", "plt.plot(X_true, y_true_pred_ridge, label='Ridge Fit (Degree 9)', color='black')\n", "plt.legend()\n", "plt.title('Ridge Regularization: Polynomial Regression (Degree 9)')\n", "plt.show()\n", "\n", "# Evaluate the error with Ridge regularization\n", "train_error_ridge = mean_squared_error(y_train, y_train_pred_ridge)\n", "test_error_ridge = mean_squared_error(y_test, y_test_pred_ridge)\n", "print(f\"Ridge Regression (Degree 9) - Training Error: {train_error_ridge:.4f}, Test Error: {test_error_ridge:.4f}\")\n", "print(\"Learned parameters:\\n\", ridge_regressor.coef_)\n", "print(\"Magnitude of learned parameters vector: \", np.linalg.norm(ridge_regressor.coef_))\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "47801f22", "metadata": {}, "source": [ "### Summary of Loss Functions\n", "- Linear Regression: Minimizes the Mean Squared Error (MSE).\n", "- Polynomial Regression: Same as linear regression but applied to polynomial features.\n", "- Ridge Regression: Minimizes MSE with an additional L2 regularization term to prevent overfitting." ] }, { "cell_type": "markdown", "id": "7b40c1cb", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "dl_pt", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 5 }