Basicly, this course focuses on supervised learning all through the semester. I'm writing this note in order to give myself a deeper understanding of all these stuffs.

Linear Predictors#


\[ h_w(x) = \langle w,x \rangle + b \]

b is bias. We can write simpler if we append a 1 at the end of each $ x $ and append b to the end of $ w $:

\[ h_w(x) = sign(\langle w,x \rangle) \]

Perceptron Algorithm#

It's an iterative algorithm that goes through all samples repeatly until convergence. Everytime it meets a falsely classified data $ x $, update $ w $ as:

\[ w^{t+1} = w^{t} + y_ix_i \]

$ y_i $ is the true lable for $ x_i $.

Linear Regression#

\[ h_w(x) = \langle w,x \rangle \]

Bias included. The output of $ h_w(x) $ is our prediction, it's a real number in most cases.

Loss function#

Usually we use mean squared loss in linear regression problems:

\[ L_S(h_w) = \frac{1}{m}\sum_{i=1}^{m}\left( h_w(x_i) - y_i \right)^2 \]

How to find the $ w $ that has a minimum loss with the previous formula?

  • Least Squares

Set the derivative of $ L_S(h_w) $ on $ w $ to zero:

\[ \frac{\partial{L_S(h_w)}}{\partial{w}} = \frac{2}{m}\sum_{i=1}^{m}\left( \langle w,x \rangle - y_i \right)x_i = 0 \]

Denote $ A $ and $ b $ as:

\[ A = \left( \sum_{i=1}^{m}x_i x_i^T \right) \] \[ b = \left( \sum_{i=1}^{m}y_i x_i \right) \]

We have:

\[ Aw = b \]

If $ A $ is invertible, $ w = A^{-1}b $ If not, $ w = A^{+}b $. So, how to compute $ A^+ $?

  • Gradient Decent

An iterative algorithm pretty like the perceptron algorithm.

\[ w^{'} = w - \alpha \frac{\partial{L_S(h_w)}}{\partial{w}} \]

Polynomial Regression#

Assume it's a $ m $ degree polynomial regression problem:

If we only have one feature:

\[ y = b_0 + b_1x + b_2x^2 + ... + b_mx^m \]

If we have two features:

\[ y = b_0 + b_1x_1 + b_2x_2 + b_3x_1^2 + b_4x_2^2 + b_5x_1x_2 + ... \]

If we have more than two features, I don't want to write it any more!

Logistic Regression#

It's used for classification.

binary class#

\[ h_w(x) = \frac{1}{1+e^{-\langle w,x \rangle}} \]

This is the form of sigmoid function. When $ w,x $ is greater than zero, the output is greater than $ 0.5 $, it means we should classify $ x $ as a positive sample. Otherwise, we classify it negative.

As the output of $ h_w(x) $ is a real number between 0 and 1 (it always represents the probability of $ x_i $ is positve), it's better to use negative log loss to represent to optimization problem.

Here we need to consider two cases:

    1. labels are {0, 1}

\[ L_S(h_w) = -\frac{1}{m}\sum_{i=1}^{m}\left(y_i\log(h_w(x_i)) + (1-y_i)\log(1-h_w(x_i))\right)\]

    1. labels are {-1, 1}

\[ L_S(h_w) = \frac{1}{m}\sum_{i=1}^{m}\log(1+\exp(-y_i\langle w,x \rangle)) \]

We can use gradient decent algorithm to solve the problem. But I don't want to write their derivatives!


\[ h_w(x) = \frac{e^{\langle w,x \rangle}}{\sum_{c=1}^{k}e^{\langle w,x \rangle}} \]

Here $ k $ is the number of classes.

loss function#

\[ L_S(h_w) = -\frac{1}{m}\sum_{i=1}^{m}\sum_{c=1}^{k}1[y_i = c]\log(h_w(x_i)_c) \]

Here $ h_w(x_i)_c $ is the probability of $ x_i $ belonging to category $ c $. $ 1[y_i = c] $ means this part is 1 if $ y_i = c $, otherwise it's 0.

Other models#

Naive Bayes#

It's so simple, do we need to talk about it any more?

The biggest sometime unreasonable presumption when using Naive Bayes is all features are independent.

Support Vector Machines#

Hard-SVM (realizable)#

First, SVM is algorithm that solves binary classification problem. It returns a halfspace. The difference between SVM and the general halfspace is that SVM tries to find a halfspace that seperates samples with the largest possible margin. So what is margin? Margin is the minimum distance between a point in the training set and the hyperplane. Let's talk about the distance between a point and the hyperplane first. Say the hyperplane is represented as:

\[ \langle w,x \rangle + b = 0 \]

Here $ w $ is a vector with length 1, and b is bias.

Now for any point $ x $, the distance from $ x $ to this hyperplane is:

\[ | \langle w,x \rangle + b | = |w||x|\cos(\theta) + b \]

The inner product of $ w $ and $ x $ can be treated as a projection of $ x $ to $ w $.

Then the problem becomes to find a hyperplane that those points with the minimum distance to the hyperplane have the maximum distance.

\[ arg\max_w \min_{i \in m} y_i(\langle w,x_i \rangle + b) \]

It's a quadratic optimization problem now. We have sufficient tool that can solve quadratic problems for us. What we need to do is generate those variables that are needed to be send into the QP function.

TODO: Math problem...

We Cannot Use SGD to Solve Hard-SVM Problem

Soft-SVM (unrealizable)#

The contraints in Hard-SVM is:

\[ y_i(\langle w,x \rangle + b) \ge 1 \]

We introduce $ m $ nonnegative slack variables \[ \xi_1 , ..., \xi_m \]

and relax the contraints as

\[ y_i(\langle w,x \rangle + b) \ge 1 - \xi_i \]

We Can Use SGD to Solve Soft-SVM Problem (minimize hinge loss)


Different definitions of distance between two points.

kernel gallery:

  • Polynomial Kernel

\[ K(x, y) = (c_1 \langle x,y \rangle + c_2)^{c_3} \]

If $ c_3 = 1$, it is a linear kernel.

  • Gaussian Kernel

\[ K(x, y) = \exp(\frac{-||x-y||^2}{c^2}) \]

  • Sigmoind Kernel

\[ K(x, y) = tanh(\alpha x^Ty + c) \]

Decision Trees#

The concept of decision tree is simple, the tricky part is how to build the decision tree. In class we talked about the greedy algorithm which minimize the gain in each step.

  • build tree (ID3)

First, the tree is empty, and we get a set of samples with size $ m $, each sample has $ d $ features. The current working node is root. We are going to build the subtree with the same principles until there is no more samples or no features to split on, or all samples are in the same label.

If we can split on this node, we will check all features to find the one that we have the maximum information gain.

Assume we are doing binary classification. On the current process, data is the set of samples we are going to split, and i is a feature index. If we split on this feature, the information gain is:

\[ gain(p(y_i = 1)) - p(x_i = True) * gain(p(y_i = 1 | x_i = True) + p(x_i = False) * gain(p(y_i = 1 | x_i = False)) \]

$ gain $ is pre-defined gain function such as train error, entropy, gini index.

  • prune tree

We need to prune a decision tree because it may overfit. The rule for deleting a node is whether we can reduce validation error.

The pruning algorithm we implemented does only prune those nodes that both of its children are leaf nodes or one of them is a leaf node and the other is empty.

Other Decision Tree Algorithms#

  • CART
  • C4.5

The difference between these two algorithms and ID3 is how they choose the target attribute to split on. ID3 chooses the biggest information gain, while CART chooses the minimum gini index, and C4.5 chooses information gain ratio as an indicator. Both CART and C4.5 can be used to do a regression problem.

Learning theories#

PAC Learning#

What's PAC Learning? PAC learnability is a property of a hypothesis class on a specific data distribution. We say a hypothesis $ H $ is PAC learnable means if we have $ m $ training samples, it would be sufficient to guarantee that the overall loss is smaller than epislon with probability greater than 1-delta.

So how can we use PAC Learnability to analysis a learning problem?

  • If we have a set of training sample, we can estimate the accuracy with the given hypothesis class and a give confidence value delta. Or we can estimate the confidence of getting the given accuracy in the hypothesis class.
  • If we are given a requirement about accuracy and confidence, we can use PAC learnability to tell us how many samples we need to achieve the goal.

realizable case#

The presumpthion is that we can get a zero loss on this data distribution, and the hypothesis class is finite. Then the relation between epsilob (accuracy), delta (confidence), size of hypothesis class, and sample complexity is:

\[ m_H(\epsilon, \delta) \le \lceil \frac{log(|H|/\delta)}{\epsilon} \rceil \]

uniform convergence#

How can we eveluate the effectiveness of the ERM solution we get from the data set $ S $? If $ S $ is a good representation of distribution $ D $, then the ERM is good. How can we say a sample is representive?

If for any hypothesis $ h $ in $ H $:

\[ |L_S(h) - L_D(h)| \le \epsilon \]

We say data set $ S $ is epsilon-representative. Then if a data set $ S $ is epsilon/2-representative:

\[ L_D(h_S) \le \min_{h\in H}L_D(h) + \epsilon \]

Now we come to the uniform convergence property. It's a property of a hypothesis class which represents the confidence of data set $ S $ is epsilon-representative.

\[ m_H^{UC}(\epsilon, \delta) \le \lceil \frac{log(2|H|/\delta)}{2\epsilon^2} \rceil \]

It has almost the same meaning forms with PAC learnability.

unrealizable case#

Here we do not assume we can find a hypothesis that perfectly works on the data set. We need to consider the representativeness of a data set.

\[ m_H(\epsilon, \delta) \le m_H^{UC}(\epsilon/2, \delta) \le \lceil \frac{log(2|H|/\delta)}{\epsilon^2} \rceil \]


Pretty simple concept.


Suppose we get a ERM solution $ h_S $. We use this hypothesis to test our testing data and get a loss value. The result can be divided into two parts:

\[ L_D(h_S) = \epsilon_{app} + \epsilon_{est} \]

  • approximation error: it's bias because we choose the set of hypothesis. Even if we choose the best hypothesis in $ H $, this error cannot be eliminated.

\[ \epsilon_{app} = \min_{h\in H}L_D(h) \]

How to improve: increase the size of hypothesis class (it's more possible to get a good hypothesis). Or choose some other features for the data set.

  • estimation error: it's normal to get an estimation error. But if it's pretty high, it means our model is overfitting.

How to improve: get more training samples, or reduce the size of hypothesis class.


A mechanism used to prevent overfitting. When we are doing optimization, we do only care about the training loss before. Now we add another component into the optimization function which represents the complexity of hypothesis class.

How to set the value?

  • Tikhonov Regularization

\[ R(w) = \lambda||w||_2^2 \]


We want to use boosting because in regularization we already lambda to zero while the approximation error is still very high. Then we combine several hypotheses together, and consider each vote to determine the final label. We say the single hypothesis used in boosting is weak learner, and the combined hypothesis is strong learner.

  • AdaBoost

We have a set of samples, each sample has an initial weight. Let's denote the initial weight list as $ D^0 $. We have a weak learner, which will return a hypothesis with $ D $ and $ S $. Then we will go several iterations to update $ D $ and get a list of hypothesis. Finally we will use the set of hypotheses to predict testing samples.

The meaning of $ D $ in each iteration is that if we correctly classified the sample, the weight of sample would decrease. Otherwise the weight increases. It makes sure that we will focus more on the falsely classified samples. Each hypothesis also has a weight which is represented by the training error in that iteration. So the final predication is an average sum of all predictions from all hypotheses.

Clustering Algorithms#

K-Nearest Neighbors#

For every sample in the training set, the 1-Nearest Neighbor is the sample itself. So the VC Dimension of 1-Nearest Neighbor Algorithm is infinite.

The algorithm is so simple that I cannot find anything to write.

If we have $ m $ training samples, the time complexity of classifying one test sample is $ O(mlogk) $


First we need to choose k centroids randomly from m samples.

Then iteratively go through all samples, assign the label of its nearest centroid to the sample, and after that, update centroids by averaging all samples in the same cluster.

The stopping condition is if the different between the current centroids and the new centroids is smaller than a constant value.

We have several choices for distance functions:

  • Euclidean Distance
  • Manhattan Distance
  • Chebychev Distance

Expectation Maximization#

The problem that EM algorithm tries to solve is that we get some data samples, assuming they are belonging to k different Gaussians distributions. We do not know the labels of each data, but we want to prediction the means and variances of these Gaussians distributions such that we get the highest probability to get the data samples we have.

First, we should generate random values for all parameters for those distritbutions. Then this algorithm can be divided into two steps:

  • Expectation step:

For each sample, compute the probabilities of the sample belonging to each distributions. We need the prior probabilities $ p(c) $, and then for sample $ x $, the probability of $ x $ belonging to $ c $ is:

\[ p(c\ |\ x) \approx p(c)*p(x\ |\ c) \]

When computing $ p(x | c) $, we are using the parameters of $ c $ in the current iteration. We will update the parameters in M step.

  • Maximization step:

Now we have labeled all samples. We can use the samples in each cluster to compute its mean and variance.

\[ \mu_c = \frac{\sum_{i=1}^{m}x_ip(c\ |\ x_i)}{\sum_{i=1}^{m}p(c\ |\ x_i)} \]

Here $ p(c | x_i) $ is the value from E step after normalization.

\[ \sigma_c^2 = \frac{\sum_{i=1}^{m}p(c\ |\ x_i)(x_i - \mu_c)(x_i - \mu_c)^T}{\sum_{i=1}^{m}p(c\ |\ x_i)}\]

In addition to updating mean and variance, we should also update the prior probabilities:

\[ p(c) = \frac{\sum_{i=1}^{m}p(c\ |\ x_i)}{m}\]

A detailed explanation can be found here:

Principal Component Analysis#

PCA is an algorithm projecting $ d $ dimension samples into a lower dimension subspace with a minimum information loss.

The definition of information loss:

\[ loss = \sum_{i=1}^{m}||x_i - UU^Tx_i||_2^2 \]

Assume we want to reduce the dimension from $ d $ to $ n $, then here $ U $ is a $ d*n $ matrix, and we have $ U^TU = I $.

The problem is to find $ n $ principal components $ u_1 $, ..., $ u_n $

Let's assume we have $ m $ training samples:

  • If $ m > d $

\[ A = X^TX \]

Let $ u_1 $, ..., $ u_n $ be the eigenvectors of A with largest eigenvalues.

  • If $ m <= d $

\[ B = XX^T \]

Let $ v_1 $, ..., $ v_n $ be the eigenvectors of B with largest eigenvalues. For each $ v_i $, we get a corresponding $ u_i $:

\[ u_i = \frac{1}{||X^Tv_i||}X^Tv_i\]