Solving SVM problems

1 Question: How do you find \(\vec{w}\), \(b\), and the α's?

You should take a look at the class notes, for starters: http://ai6034.mit.edu/fall12/images/SVM_and_Boosting.pdf

You can also take a look at the next section to see an example of (part of) a worked problem.

When solving SVM problems, there are some useful equations to keep in mind:

  1. \(\vec{w}\cdot \vec{x} + b = 0\) defines the boundary, and in particular \(\vec{w}\cdot \vec{x} + b \geq 0\) defines the positive side of the boundary.
  2. \(\vec{w}\cdot \vec{x} + b = \pm 1\) defines the positive and negative gutters.
  3. The distance between the gutters (the width of the margin) is \(\text{margin-width} = \frac{2}{||\vec{w}||} = \frac{2}{\sqrt{\vec{w}\cdot \vec{w}}}\)
  4. \(\sum_{\text{training data}} y_i \alpha_i = 0\)
  5. \(\sum_{\text{training data}} y_i \alpha_i \vec{x}_i = \vec{w}\)

The last two equations involve the “supportiveness” parameters \(\alpha_i\). Look at equation (5) and notice what would happen if some \(\alpha_j\) were zero: the corresponding term in the sum would be zero, and hence that training point wouldn't contribute to \(\vec{w}\), and hence wouldn't affect where the boundary was drawn. In fact, notice that in this case you could delete \(\vec{x}_j\) altogether and the boundary would still be drawn in the same place.

Support vector guideline 1: To see if a point \(\vec{x}_j\) is a support vector, imagine deleting it and see if you would draw a different SVM boundary. If you would draw a different SVM boundary, the point is a support vector (\(\alpha_j > 0\)). If you would draw the same boundary, even if the point were deleted, the point isn't a support vector (\(\alpha_j = 0\)).

Support vector guideline 2:

  • Only points on the gutter can be support vectors; if a point isn't on the gutter, it's not a support vector.
  • If a point is on the gutter, it might or might not be a support vector—you can determine whether it is using Support vector guideline 1.

Support vector guideline 3: You may find it useful to think of the \(\alpha_i\) as measuring “supportiveness”. This means, for example, that:

  • \(\alpha_i\) is zero for non-support vectors, i.e. training points that do not determine the boundary, and which would not affect the placement of the boundary if deleted.
  • When you compare two separate SVM problems, where the first has support vectors that are far from the boundary, and the second has support vectors very close to the boundary, the latter support vectors have comparatively higher \(\alpha_i\) values.

1.1 Can α be zero for support vectors?

No, α is zero for a point if and only if it's not a support vector. You can see this because in equation (5), just the points with nonzero alphas contribute to \(\vec{w}\).

1.2 Are all points on the gutter support vectors? How do you tell whether a point is a support vector?

Not all points on the gutter are support vectors. To tell if a training point is a support vector, imagine what would happen if you deleted it — would the decision boundary be different? If it would, the point is a support vector. If it wouldn't, the point isn't a support vector.

2 2011 Q4 Part A

The equation for the line you drew in part A is \(y = 0\), and points will be classified as positive if \(y\geq 0\).

Now in general, the equation for the decision boundary is \(\vec{w}\cdot \vec{x} + b = 0\), and points \(\vec{x}\) are classified as positive if \(\vec{w}\cdot \vec{x} + b \geq 0\). To solve for \(\vec{w}\) and \(b\), you just manipulate your inequation for the boundary into this form:

$$y\geq 0$$ $$0x + 1y + 0 \geq 0$$ $$\begin{bmatrix}0&1\end{bmatrix}\cdot \begin{bmatrix}x\\y\end{bmatrix} + 0 \geq 0$$

And so we have:

$$\underbrace{\begin{bmatrix}0&1\end{bmatrix}}_{\vec{w}}\cdot \underbrace{\begin{bmatrix}x\\y\end{bmatrix}}_{\vec{x}} + \underbrace{0}_{b} \geq 0$$

…which is almost right. The trouble is that we can multiply this inequation by any positive constant \(c>0\) and it will still be true — so, the right answer might not be the \(\vec{w}\) and \(b\) we found above, but instead a multiple of them. To fix this, we multiply the inequation by \(c>0\), and solve for \(c\):

$$c\cdot (0x + 1y + 0) \geq c\cdot0$$ $$\begin{bmatrix}0&c\end{bmatrix}\cdot \begin{bmatrix}x\\y\end{bmatrix} + 0 \geq 0$$

One formula that you can use to solve for \(c\) is the margin-width formula; by examining the lines you drew, you see that the margin width is 4. Now, in general, the equation for margin width is

$$\text{margin-width} = \frac{2}{||\vec{w}||} = \frac{2}{\sqrt{\vec{w}\cdot \vec{w}}}$$

So, for the second time, we use the technique of setting the quantity you can see (margin width = 4) equal to the general formula (margin width = 2/||w||) and solving for the parameters in the general formula:

$$\frac{2}{||w||} = 4$$ $$\frac{2}{\sqrt{\vec{w}\cdot \vec{w}}} = 4$$ $$\frac{4}{\vec{w}\cdot \vec{w}} = 16$$ (squaring both sides) $$\vec{w}\cdot \vec{w} = \frac{1}{4}$$ $$\begin{bmatrix}0&c\end{bmatrix}\cdot \begin{bmatrix}0\\c\end{bmatrix} = \frac{1}{4}$$

(This uses the expression for \(\vec{w}\) we got after we multiplied by \(c\) above.)

$$\begin{eqnarray*} c^2 &=& \frac{1}{4}\\ c &=& \pm \frac{1}{2}\\ c &=& \frac{1}{2}\end{eqnarray*}$$...because we require \(c>0\) so that it doesn't flip the inequality.

Returning to our inequality, we have

$$\begin{eqnarray*}\begin{bmatrix}0&c\end{bmatrix}\cdot \begin{bmatrix}x\\y\end{bmatrix} + 0 &\geq& 0\\ \begin{bmatrix}0&\frac{1}{2}\end{bmatrix}\cdot \begin{bmatrix}x\\y\end{bmatrix} + 0 &\geq& 0\end{eqnarray*}$$ $$\underbrace{\begin{bmatrix}0&\frac{1}{2}\end{bmatrix}}_{\vec{w}}\cdot \underbrace{\begin{bmatrix}x\\y\end{bmatrix}}_{\vec{x}} + \underbrace{0}_{b} \geq 0$$

And this is truly the right answer, now that we have solved for \(c\).

3 What's the use of gutters?

The gutters are the regions where you have no hard data supporting your classification, but the decision boundary is, as its name implies, the cutoff for making a decision between the two classifications. As such, you would likely have lower confidence in any decision within the gutters, but a classifier needs to make a decision one way or the other for all possible test data (there's no reason not to do so, as making no decision means you're automatically wrong in all cases, whereas even a lousy classifier will get some of them right).

----

Here are two more ways to look at it. First of all, and simplest, SVMs are supposed to find the boundary with the widest margin. The width of the margin is just the perpendicular distance between the gutters, so the gutters are important for that reason.

Second of all, if you attempt to define the SVM problem geometrically, i.e. put it into a form that can be solved by a computer, it looks something like the following.

Find a boundary — which is defined by a normal vector \(\vec{w}\) and an offset \(b\) — such that:

  1. The boundary “ \(\vec{w}\cdot \vec{x} + b = 0\) ” separates the positive and negative points
  2. The width of the margin is as large as possible. Remember, the margin width is twice the distance between the boundary and the training point that is closest to the boundary; in other words, $$\text{margin-width} = \min_{\text{training data}} 2 y_i (\vec{w}\cdot \vec{x}_i + b)$$

Unfortunately, the problem as stated has no unique solution, since if you find a satisfactory pair \(\langle \vec{w}, b \rangle\) , then the pair \(\langle 3\vec{w}, 3b\rangle\) (for example) defines the same boundary but has a larger margin. The problem is that any multiple of a satisfactory pair yields another satisfactory pair. In essence, we're just changing the units of measurement without materially affecting the boundary.

We can remove this additional degree of freedom by adding another constraint to the problem which establishes a sense of scale. For example, we could require \(\vec{w}\) to be a unit normal vector, i.e. we could require that \(||\vec{w}|| = 1\). This fixes the problem and gives SVMs a unique solution.

In 6.034, and elsewhere, we use an alternate constraint; we impose the gutter conditions:

\(\vec{w}\cdot \vec{x_+} + b = + 1\) for positive training points \(\vec{x_+}\).

\(\vec{w}\cdot \vec{x_-} + b = -1\) for negative training points \(\vec{x_-}\).

which we often combine into a single inequality: \(y_i (\vec{w}\cdot \vec{x}_i + b) \geqslant 1\) for all training points \(\langle \vec{x}_i, y_i\rangle\).

This constraint is enough to eliminate the extra degree of freedom and give SVMs a unique solution. Moreover, you get a nice formula for the margin width by requiring that it holds:

\(\text{margin-width} = \frac{2}{||\vec{w}||}\) because the gutter conditions are enforced.

That's why gutters are important.

Date: 2013-11-06T01:14-0500

Author: Dylan Holmes

Org version 7.9.3f with Emacs version 24

Validate XHTML 1.0