Optimization: Loss Function Under the Hood (Part III) (2024)

Published in

Towards Data Science

8 min read

Oct 15, 2018

Optimization: Loss Function Under the Hood (Part III) (3)

Continuing this journey, I have discussed the loss function and optimization process of linear regression at Part I, logistic regression at part II, and this time, we are heading to Support Vector Machine.

Let’s start from Linear SVM that is known as SVM without kernels. Looking at the scatter plot by two features X1, X2 as below. We actually separate two classes in many different ways, the pink line and green line are two of them. SVM ends up choosing the green line as the decision boundary, because how SVM classify samples is to find the decision boundary with the largest margin that is the largest distance from a sample who is closest to decision boundary. That’s why Linear SVM is also called Large Margin Classifier.

Optimization: Loss Function Under the Hood (Part III) (4)

Who are the support vectors? Support vector is a sample that is incorrectly classified or a sample close to a boundary. Looking at the plot below. The samples with red circles are exactly decision boundary. In SVM, only support vectors has an effective impact on model training, that is saying removing non support vector has no effect on the model at all. Why? We will figure it out from its cost function.

Optimization: Loss Function Under the Hood (Part III) (5)

The loss function of SVM is very similar to that of Logistic Regression. Looking at it by y = 1 and y = 0 separately in below plot, the black line is the cost function of Logistic Regression, and the red line is for SVM. Please note that the X axis here is the raw model output, θᵀx. Remember putting the raw model output into Sigmoid Function gives us the Logistic Regression’s hypothesis. What is the hypothesis for SVM? It’s simple and straightforward. When θᵀx ≥ 0, predict 1, otherwise, predict 0.

Optimization: Loss Function Under the Hood (Part III) (6)

Optimization: Loss Function Under the Hood (Part III) (7)

Then back to loss function plot, aka. Hinge Loss, when the actual is 1 (left plot as below), if θᵀx ≥ 1, no cost at all, if θᵀx < 1, the cost increases as the value of θᵀx decreases. Wait! When θᵀx ≥ 0, we already predict 1, which is the correct prediction. Why does the cost start to increase from 1 instead of 0? Yes, SVM gives some punishment to both incorrect predictions and those close to decision boundary ( 0 < θᵀx <1), that’s how we call them support vectors. When data points are just right on the margin, θᵀx = 1, when data points are between decision boundary and margin, 0< θᵀx <1. I will explain why some data points appear inside of margin later. As for why removing non-support vectors won’t affect model performance, we are able to answer it now. Remember model fitting process is to minimize the cost function. Since there is no cost for non-support vectors at all, the total value of cost function won’t be changed by adding or removing them.

Optimization: Loss Function Under the Hood (Part III) (8)

Let’s write the formula for SVM’s cost function:

Optimization: Loss Function Under the Hood (Part III) (9)

We can also add regularization to SVM. For example, adding L2 regularized term to SVM, the cost function changed to:

Optimization: Loss Function Under the Hood (Part III) (10)

Different from Logistic Regression using λ as the parameter in front of regularized term to control the weight of regularization, correspondingly, SVM uses C in front of fit term. Intuitively, the fit term emphasizes fit the model very well by finding optimal coefficients, and the regularized term controls the complexity of the model by constraining the large value of coefficients. There is a trade-off between fitting the model well on training dataset and the complexity of the model that may lead to overfitting, which can be adjusted by tweaking the value of λ or C. Both λ and C prioritize how much we care about optimize fit term and regularized term. Placing at different places of cost function, C actually plays a role similar to 1/λ.

With a very large value of C (similar to no regularization), this large margin classifier will be very sensitive to outliers. For example, in the plot on the left as below, the ideal decision boundary should be like green line, by adding the orange orange triangle (outlier), with a vey big C, the decision boundary will shift to the orange line to satisfy the the rule of large margin. On the other hand, C also plays a role to adjust the width of margin which enables margin violation. See the plot below on the right. When C is small, the margin is wider shown as green line. The pink data points have violated the margin. It is especially useful when dealing with non-separable dataset. So This is how regularization impact the choice of decision boundary that make the algorithm work for non-linearly separable dataset with tolerance of data points who are misclassified or have margin violation.

Optimization: Loss Function Under the Hood (Part III) (11)

When decision boundary is not linear, the structure of hypothesis and cost function stay the same. Firstly, let’s take a look.

Optimization: Loss Function Under the Hood (Part III) (12)

You may have noticed that non-linear SVM’s hypothesis and cost function are almost the same as linear SVM, except ‘x’ is replaced by ‘f’ here. f is the function of x, and I will discuss how to find the f next. Let’s tart from the very first beginning. Assume that we have one sample (see the plot below) with two features x1, x2. I randomly put a few points (l⁽¹⁾, l⁽²⁾, l⁽³⁾) around x, and called them landmarks. I would like to see how close x is to these landmarks respectively, which is noted as f1 = Similarity(x, l⁽¹⁾) or k(x, l⁽¹⁾), f2 = Similarity(x, l⁽²⁾) or k(x, l⁽²⁾), f3 = Similarity(x, l⁽³⁾) or k(x, l⁽³⁾).

Optimization: Loss Function Under the Hood (Part III) (13)

So this is called Kernel Function, and it’s exact ‘f’ that you have seen from above formula. What is it inside of the Kernel Function? In other words, how should we describe x’s proximity to landmarks? There are different types. Gaussian Kernel is one of the most popular ones. It’s calculated with Euclidean Distance of two vectors and parameter σ that describes the smoothness of the function. Gaussian kernel provides a good intuition. If x ≈ l⁽¹⁾, f1 ≈ 1, if x is far from l⁽¹⁾, f1 ≈ 0. In Scikit-learn SVM package, Gaussian Kernel is mapped to ‘rbf’ , Radial Basis Function Kernel, the only difference is ‘rbf’ uses γ to represent Gaussian’s 1/2σ² .

Optimization: Loss Function Under the Hood (Part III) (14)

We can say that the position of sample x has been re-defined by those three kernels. That is saying, Non-Linear SVM computes new features f1, f2, f3, depending on the proximity to landmarks, instead of using x1, x2 as features any more, and that is decided by the chosen landmarks. This is where the raw model output θᵀf is coming from. Let’s try a simple example. θᵀf = θ0 + θ1f1 + θ2f2 + θ3f3. Assign θ0 = -0.5, θ1 = θ2 = 1, θ3 = 0, so the θᵀf turns out to be -0.5 + f1 + f2. Looking at the first sample(S1) which is very close to l⁽¹⁾ and far from l⁽²⁾, l⁽³⁾ , with Gaussian kernel, we got f1 = 1, f2 = 0, f3 = 0, θᵀf = 0.5. According to hypothesis mentioned before, predict 1. Sample 2(S2) is far from all of landmarks, we got f1 = f2 = f3 =0, θᵀf = -0.5 < 0, predict 0. Based on current θs, it’s easy to notice that any point near to l⁽¹⁾ or l⁽²⁾ will be predicted as 1, otherwise 0. The green line demonstrates an approximate decision boundary as below.

Optimization: Loss Function Under the Hood (Part III) (15)

We have just went through the prediction part with certain features and coefficients that I manually chose. So, where are these landmarks coming from? How many landmarks do we need? Ok, it might surprise you that given m training samples, the location of landmarks is exactly the location of your m training samples.

Optimization: Loss Function Under the Hood (Part III) (16)

That is saying Non-Linear SVM recreates the features by comparing each of your training sample with all other training samples. Thus the number of features for prediction created by landmarks is the the size of training samples. For a given sample, we have updated features as below:

Optimization: Loss Function Under the Hood (Part III) (17)

Regarding to recreating features, this concept is like that when creating a polynomial regression to reach a non-linear effect, we can add some new features by making some transformations to existing features such as square it. For example, you have two features x1 and x2. To create polynomial regression, you created θ0 + θ1x1 + θ2x2 + θ3x1² + θ4x1²x2, as so your features become f1 = x1, f2 = x2, f3 = x1², f4 = x1²x2

Let’s rewrite the hypothesis, cost function, and cost function with regularization.

Optimization: Loss Function Under the Hood (Part III) (18)

To achieve a good performance of model and prevent overfitting, besides picking a proper value of regularized term C, we can also adjust σ² from Gaussian Kernel to find the balance between bias and variance. Take a certain sample x and certain landmark l as an example, when σ² is very large, the output of kernel function f is close 1, as σ² getting smaller, f moves towards to 0. In other words, with a fixed distance between x and l, a big σ² regards it ‘closer’ which has higher bias and lower variance(underfitting),while a small σ² regards it ‘further’ which has lower bias and higher variance (overfitting).

Like Logistic Regression, SVM’s cost function is convex as well. The most popular optimization algorithm for SVM is Sequential Minimal Optimization that can be implemented by ‘libsvm’ package in python. SMO solves a large quadratic programming(QP) problem by breaking them into a series of small QP problems that can be solved analytically to avoid time-consuming process to some degree. In terms of detailed calculations, It’s pretty complicated and contains many numerical computing tricks that makes computations much more efficient to handle very large training datasets.

In summary, if you have large amount of features, probably Linear SVM or Logistic Regression might be a choice. If you have small number of features (under 1000) and not too large size of training samples, SVM with Gaussian Kernel might work for you data well .