Machine Learning (CS 567)
Fall 2017
Dr. Nazar
Khan
The ability of biological brains to sense, perceive, analyse and recognise patterns can only be described as stunning. Furthermore, they have the ability to learn from new examples. Mankind's understanding of how biological brains operate exactly is embarrassingly limited.
However, there do exist numerous 'practical' techniques that give machines the 'appearance' of being intelligent. This is the domain of Statistical Pattern Recognition and Machine learning. Instead of attempting to mimic the complex workings of a biological brain, this course aims at explaining mathematically wellfounded and empirically successful techniques for analysing patterns and learning from them.
Accordingly, this course is a mathematically involved introduction into the field of pattern recognition and machine learning. It will prepare students for further study/research in the areas of Pattern Recognition, Machine Learning, Computer Vision, Data Analysis and other areas attempting to solve Artificial Intelligence (AI) type problems.
Passing this course is necessary for students planning to undertake research with Dr. Nazar Khan.
Course Outline
Prerequisites:
The course is designed to be selfcontained. So the required mathematical details will be covered in the lectures. However, this is a mathheavy course. Students are encouraged to brush up on their knowledge of
 calculus (differentiation, partial derivatives)
 linear algebra (vectors, matrices, dotproduct, orthogonality, eigenvectors, SVD)
 probability and statistics
The students should know that the only way to benefit from this course is to be prepared to spend lots of hours reading the text book and attempting its exercises (preferably) alone or with a classfellow.
Text:
 (Required) Pattern Recognition and Machine Learning by Christopher M. Bishop (2006)
 (Recommended) Pattern Classification by Duda, Hart and Stork (2001)
Lectures:
Monday  8:15 am  9:40 am  Al Khwarizmi Lecture Theater 
Wednesday  8:15 am  9:40 am  Al Khwarizmi Lecture Theater 
Office Hours:
Monday  10:00 am  12:00 pm 
Teaching Assistant:
Programming Environment: MATLAB
 MATLAB Resources (by Aykut Erdem):
Grading:
Assignments 
20% 
Quizzes 
5% 
MidTerm 
35% 
Final 
40% 
 To determine course grade, graduate students will be evaluated in a more rigorous manner.
 Theoretical assignments have to be submitted before the lecture on the due date.
 There will be no makeup for any missed quiz.
 Makeup for a midterm or final exam will be allowed only under exceptional
circumstances provided that the instructor has been notified beforehand.
 Instructor reserves the right to deny requests for any makeup quiz or exam.
 Worst score on quizzes will be dropped.
 Worst score on assignments will be dropped.
Assignments:
Grades:
Grading sheet (Accessible only through your PUCIT email account)
Content:
 Lecture 1: Introduction
 Lectures 2 and 3: Curve Fitting
 Curve Fitting (Overfitting vs. Generalization)
 Regularized Curve Fitting
 Lecture 4: Probability
 Frequentist view
 Sum rule
 Product rule
 Bayes' theorem
 Lecture 5: Maximum Likelihood (ML) Estimation
 Gaussian Distribution
 Fitting a Gaussian Distribution to Data
 Probabilistic Curve Fitting
 Likelihood
 LogLikelihood
 Maximum Likelihood (ML) Estimation
 Lecture 6: Maximum Posterior (MAP) Estimation
 Posterior ∝ Likelihood * Prior
 Bayesian Curve Fitting
 Lecture 7: Optimization
 Model Selection (Cross Validation)
 Calculus of variations
 Lagrange Multipliers
 Lecture 8: Decision Theory
 Minimising number of misclassifications
 Minimising expected loss
 Benefits of knowing posterior distributions
 Generative vs Discriminative vs. Discriminant functions
 Loss functions for regression problems
 Lecture 9: Information Theory
 Information ∝ 1/Probability
 Entropy = expected information (measure of uncertainty)
 Maximum Entropy Discrete Distribution (Uniform)
 Maximum Entropy Continuous Distribution (Gaussian)
 Jensen's Inequality
 Relative Entropy (KL divergence)
 Mutual Information
 Lecture 10: Probability Distributions and Parametric Density Estimation
 Density Estimation is fundamentally illposed
 Parametric Density Estimation
 Probability Distributions
 Bernoulli
 Binomial
 Beta
 Multinomial
 Dirichlet
 Sequential Learning via Conjugate Priors
 Lecture 11: Gaussian Distribution  A Second Look
 Lecture 12: NonParametric Density Estimation
 NonParametric Density Estimation
 Histogram based
 Kernel estimators
 Nearest neighbours
 Lecture 13: Linear Models for Regression
 Equivalence of likelihood maximisation (ML) and SSE minimisation (Least Squares)
 Design matrix
 Pseudoinverse
 Regularized leastsquares estimation
 Linear regression for multivariate targets
 Lecture 14: Linear Models for Classification
 Generalized linear model
 Orthogonality of w and decision surface
 Discriminant Functions  model the decision surface directly
 Least Squares Classification  y(x)=f(w'x)
 Fisher's Linear Discriminant  J(w)=w'*S_b*w / w'*S_w*w
 Perceptron  y(x)=step(w'φ(x))
 Gradient Descent
 Batch
 Sequential
 Stochastic  Better avoidance of local minima
 Minibatch
 Probabilistic Generative Models  model posterior p(C_kx) via classconditional p(xC_k) and prior p(C_k)
 Lecture 15: Probabilistic Discriminative Models  model posterior p(C_kx) directly
 Logistic Sigmoid function and its derivative
 Softmax function and its derivative
 Positive definite matrix
 Logistic Regression via Logistic Sigmoid
 Positive definite Hessian implies convexity which implies unique, global minimum
 NewtonRaphson updates constitute IRLS algorithm.
 Multiclass Logistic Regression via Softmax
