## Archive for **June 2018**

## An intuitive introduction to support vector machines using R – Part 1

About a year ago, I wrote a piece on support vector machines as a part of my *gentle introduction to data science R* series. So it is perhaps appropriate to begin this piece with a few words about my motivations for writing yet another article on the topic.

Late last year, a curriculum lead at DataCamp got in touch to ask whether I’d be interested in developing a course on SVMs for them.

My answer was, obviously, an enthusiastic “Yes!”

Instead of rehashing what I had done in my previous article, I thought it would be interesting to try an approach that focuses on building an intuition for how the algorithm works using examples of increasing complexity, supported by visualisation rather than math. This post is the first part of a two-part series based on this approach.

The article builds up some basic intuitions about support vector machines (abbreviated henceforth as SVM) and then focuses on linearly separable problems. Part 2 (to be released at a future date) will deal with radially separable and more complex data sets. The focus throughout is on developing an understanding *what* the algorithm does rather than the technical details of how it does it.

Prerequisites for this series are a basic knowledge of R and some familiarity with the *ggplot* package. However, even if you don’t have the latter, you should be able to follow much of what I cover so I encourage you to press on regardless.

<advertisement> if you have a DataCamp account, you may want to check out my course on support vector machines using R. Chapters 1 and 2 of the course closely follow the path I take in this article. </advertisement>

### A one dimensional example

A soft drink manufacturer has two brands of their flagship product: *Choke* (sugar content of 11g/100ml) and *Choke-R* (sugar content 8g/100 ml). The actual sugar content can vary quite a bit in practice so it can sometimes be hard to figure out the brand given the sugar content. Given sugar content data for 25 samples taken randomly from both populations (see file sugar_content.xls), our task is to come up with a decision rule for determining the brand.

Since this is one-variable problem, the simplest way to discern if the samples fall into distinct groups is through visualisation. Here’s one way to do this using *ggplot*:

…and here’s the resulting plot:

Note that we’ve simulated a one-dimensional plot by setting all the y values to 0.

From the plot, it is evident that the samples fall into distinct groups: low sugar content, bounded above by the 8.8 g/100ml sample and high sugar content, bounded below by the 10 g/100ml sample.

Clearly, any point that lies between the two points is an acceptable *decision boundary.* We could, for example, pick 9.1g/100ml and 9.7g/100ml. Here’s the R code with those points added in. Note that we’ve made the points a bit bigger and coloured them red to distinguish them from the sample points.

label=d_bounds$sep, size=2.5,

vjust=2, hjust=0.5, colour=”red”)

And here’s the plot:

Now, a bit about the decision rule. Say we pick the first point as the decision boundary, the decision rule would be:

Say we pick 9.1 as the decision boundary, our classifier (in R) would be:

The other one is left for you as an exercise.

Now, it is pretty clear that although either these points define an acceptable decision boundary, neither of them are the best. Let’s try to formalise our intuitive notion as to why this is so.

The *margin* is the distance between the points in both classes that are closest to the decision boundary. In case at hand, the margin is 1.2 g/100ml, which is the difference between the two extreme points at 8.8 g/100ml (Choke-R) and 10 g/100ml (Choke). It should be clear that the best separator is the one that lies halfway between the two extreme points. This is called the *maximum margin separator*. The maximum margin separator in the case at hand is simply the average of the two extreme points:

geom_point(data=mm_sep,aes(x=mm_sep$sep, y=c(0)), colour=”blue”, size=4)

And here’s the plot:

We are dealing with a one dimensional problem here so the decision boundary is a point. In a moment we will generalise this to a two dimensional case in which the boundary is a straight line.

Let’s close this section with some general points.

Remember this is a sample not the entire population, so it is quite possible (indeed likely) that there will be as yet unseen samples of Choke-R and Choke that have a sugar content greater than 8.8 and less than 10 respectively. So, the best classifier is one that lies at the greatest possible distance from both classes. The maximum margin separator is that classifier.

This toy example serves to illustrate the main aim of SVMs, which is to find an optimal separation boundary in the sense described here. However, doing this for real life problems is not so simple because life is not one dimensional. In the remainder of this article and its yet-to-be-written sequel, we will work through examples of increasing complexity so as to develop a good understanding of how SVMs work in addition to practical experience with using the popular SVM implementation in R.

<Advertisement> Again, for those of you who have DataCamp premium accounts, here is a course that covers pretty much the entire territory of this two part series. </Advertisement>

### Linearly separable case

The next level of complexity is a two dimensional case (2 predictors) in which the classes are separated by a straight line. We’ll create such a dataset next.

Let’s begin by generating 200 points with attributes x1 and x2, randomly distributed between 0 and 1. Here’s the R code:

Let’s visualise the generated data using a scatter plot:

And here’s the plot

Now let’s classify the points that lie above the line *x1=x2* as belonging to the class +1 and those that lie below it as belonging to class -1 (the class values are arbitrary choices, I could have chosen them to be anything at all). Here’s the R code:

Let’s modify the plot in Figure 4, colouring the points classified as +1n blue and those classified -1 red. For good measure, let’s also add in the decision boundary. Here’s the R code:

Note that the parameters in *geom_abline()* are derived from the fact that the line* x1=x2* has slope 1 and y intercept 0.

Here’s the resulting plot:

Next let’s introduce a margin in the dataset. To do this, we need to exclude points that lie within a specified distance of the boundary. A simple way to approximate this is to exclude points that have *x1* and *x2* values that differ by less a pre-specified value, *delta*. Here’s the code to do this with *delta* set to 0.05 units.

The check on the number of datapoints tells us that a number of points have been excluded.

Running the *previous* *ggplot* code block yields the following plot which clearly shows the reduced dataset with the depopulated region near the decision boundary:

Let’s add the margin boundaries to the plot. We know that these are parallel to the decision boundary and lie delta units on either side of it. In other words, the margin boundaries have slope=1 and y intercepts *delta* and –*delta*. Here’s the *ggplot* code:

And here’s the plot with the margins:

OK, so we have constructed a dataset that is *linearly separable*, which is just a short code for saying that the classes can be separated by a straight line. Further, the dataset has a margin, i.e. there is a “gap” so to speak, between the classes. Let’s save the dataset so that we can use it in the next section where we’ll take a first look at the *svm()* function in the *e1071* package.

That done, we can now move on to…

### Linear SVMs

Let’s begin by reading in the datafile we created in the previous section:

We then split the data into training and test sets using an 80/20 random split. There are many ways to do this. Here’s one:

The next step is to build the an SVM classifier model. We will do this using the *svm() f*unction which is available in the *e1071* package. The *svm()* function has a range of parameters. I explain some of the key ones below, in particular, the following parameters: *type*, *cost*, *kernel* and *scale*. It is recommended to have a browse of the documentation for more details.

The *type* parameter specifies the algorithm to be invoked by the function. The algorithm is capable of doing both classification and regression. We’ll focus on classification in this article. Note that there are two types of classification algorithms, nu and C classification. They essentially differ in the way that they penalise margin and boundary violations, but can be shown to lead to equivalent results. We will stick with C classification as it is more commonly used. The “C” refers to the cost which we discuss next.

The *cost* parameter specifies the penalty to be applied for boundary violations. This parameter can vary from 0 to infinity (in practice a large number compared to 0, say 10^6 or 10^8). We will explore the effect of varying *cost* later in this piece. To begin with, however, we will leave it at its default value of 1.

The *kernel* parameter specifies the kind of function to be used to construct the decision boundary. The options are linear, polynomial and radial. In this article we’ll focus on linear kernels as we know the decision boundary is a straight line.

The *scale* parameter is a Boolean that tells the algorithm whether or not the datapoints should be scaled to have zero mean and unit variance (i.e. shifted by the mean and scaled by the standard deviation). Scaling is generally good practice to avoid undue influence of attributes that have unduly large numeric values. However, in this case we will avoid scaling as we know the attributes are bounded and (more important) we would like to plot the boundary obtained from the algorithm manually.

Building the model is a simple one-line call, setting appropriate values for the parameters:

We expect a linear model to perform well here since the dataset it is linear by construction. Let’s confirm this by calculating training and test accuracy. Here’s the code:

The perfect accuracies confirm our expectation. However, accuracies by themselves are misleading because the story is somewhat more nuanced. To understand why, let’s plot the predicted decision boundary and margins using *ggplot*. To do this, we have to first extract information regarding these from the svm model object. One can obtain summary information for the model by typing in the model name like so:

kernel = “linear”, scale = FALSE)

Which outputs the following: the function *call*, SVM *type*, *kernel* and *cost* (which is set to its default). In case you are wondering about *gamma, * although it’s set to 0.5 here, it plays no role in linear SVMs. We’ll say more about it in the sequel to this article in which we’ll cover more complex kernels. More interesting are the *support vectors*. In a nutshell, these are training dataset points that *specify the location of the decision boundary*. We can develop a better understanding of their role by visualising them. To do this, we need to know their coordinates and indices (position within the dataset). This information is stored in the SVM model object. Specifically, the *index* element of *svm_model* contains the indices of the training dataset points that are support vectors and the *SV* element lists the coordinates of these points. The following R code lists these explicitly (Note that I’ve not shown the outputs in the code snippet below):

Let’s use the indices to visualise these points in the training dataset. Here’s the ggplot code to do that:

And here is the plot:

We now see that the support vectors are clustered around the boundary and, in a sense, serve to define it. We will see this more clearly by plotting the predicted decision boundary. To do this, we need its slope and intercept. These aren’t available directly available in the *svm_model*, but they can be extracted from the *coefs*, *SV* and *rho* elements of the object.

The first step is to use *coefs* and the support vectors to build the what’s called the *weight vector*. The *weight vector* is given by the product of the *coefs* matrix with the matrix containing the SVs. Note that the fact that only the support vectors play a role in defining the boundary is consistent with our expectation that the boundary should be fully specified by them. Indeed, this is often touted as a feature of SVMs in that it is one of the few classifiers that depends on only a small subset of the training data, i.e. the datapoints closest to the boundary rather than the entire dataset.

Once we have the weight vector, we can calculate the slope and intercept of the predicted decision boundary as follows:

Note that the slope and intercept are quite different from the correct values of 1 and 0 (reminder: the actual decision boundary is the line *x1=x2* by construction). We’ll see how to improve on this shortly, but before we do that, let’s plot the decision boundary using the slope and intercept we have just calculated. Here’s the code:

And here’s the augmented plot:

The plot clearly shows how the support vectors “support” the boundary – indeed, if one draws line segments from each of the points to the boundary in such a way that the intersect the boundary at right angles, the lines can be thought of as “holding the boundary in place”. Hence the term *support vector*.

This is a good time to mention that the *e1071* library provides a built-in plot method for *svm* function. This is invoked as follows:

The svm *plot *function takes a formula specifying the plane on which the boundary is to be plotted. This is not necessary here as we have only two predictors (x1 and x2) which automatically define a plane.

Here is the plot generated by the above code:

Note that the axes are switched (x1 is on the y axis). Aside from that, the plot is reassuringly similar to our *ggplot* version in Figure 9. Also note that that the support vectors are marked by “x”. Unfortunately the built in function does not display the margin boundaries, but this is something we can easily add to our home-brewed plot. Here’s how. We know that the margin boundaries are parallel to the decision boundary, so all we need to find out is their intercept. It turns out that the intercepts are offset by an amount *1/w[2]* units on either side of the decision boundary. With that information in hand we can now write the the code to add in the margins to the plot shown in Figure 9. Here it is:

geom_abline(slope=slope_1,intercept = intercept_1+1/w[2], linetype=”dashed”)

And here is the plot with the margins added in:

Note that the predicted margins are much wider than the actual ones (compare with Figure 7). As a consequence, many of the support vectors lie within the predicted margin – that is, they violate it. The upshot of the wide margin is that the decision boundary is not tightly specified. This is why we get a significant difference between the slope and intercept of predicted decision boundary and the actual one. We can sharpen the boundary by narrowing the margin. How do we do this? We make margin violations more expensive by increasing the *cost*. Let’s see this margin-narrowing effect in action by building a model with *cost = *100 on the same training dataset as before. Here is the code:

I’ll leave you to calculate the training and test accuracies (as one might expect, these will be perfect).

Let’s inspect the *cost=100* model:

kernel = “linear”,cost=100, scale = FALSE)

The number of support vectors is reduced from 55 to 6! We can plot these and the boundary / margin lines using *ggplot* as before. The code is identical to the previous case (see code block preceding Figure 8). If you run it, you will get the plot shown in Figure 12.

Since the boundary is more tightly specified, we would expect the slope and intercept of the predicted boundary to be considerably closer to their actual values of 1 and 0 respectively (as compared to the default cost case). Let’s confirm that this is so by calculating the slope and intercept as we did in the code snippets preceding Figure 9. Here’s the code:

Which nicely confirms our expectation.

The decision boundary and margins for the high cost case can also be plotted with the code shown earlier. Her it is for completeness:

geom_abline(slope=slope_100,intercept = intercept_100+1/w[2], linetype=”dashed”)

And here’s the plot:

SVMs that allow margin violations are called *soft margin classifiers* and those that do not are called *hard*. In this case, the hard margin classifier does a better job because it specifies the boundary more accurately than its soft counterpart. However, this does not mean that hard margin classifier are to be preferred over soft ones in all situations. Indeed, in real life, where we usually do not know the shape of the decision boundary upfront, soft margin classifiers can allow for a greater degree of uncertainty in the decision boundary thus improving generalizability of the classifier.

OK, so now we have a good feel for what the SVM algorithm does in the linearly separable case. We will round out this article by looking at a real world dataset that fortuitously turns out to be almost linearly separable: the famous (notorious?) iris dataset. It is instructive to look at this dataset because it serves to illustrate another feature of the *e1071* SVM algorithm – its capability to handle classification problems that have more than 2 classes.

### A multiclass problem

The iris dataset is well-known in the machine learning community as it features in many introductory courses and tutorials. It consists of 150 observations of 3 *species* of the iris flower – *setosa*, *versicolor* and *virginica*. Each observation consists of numerical values for 4 independent variables (predictors): petal length, petal width, sepal length and sepal width. The dataset is available in a standard installation of R as a built in dataset. Let’s read it in and examine its structure:

Now, as it turns out, petal length and petal width are the key determinants of species. So let’s create a scatterplot of the datapoints as a function of these two variables (i.e. project each data point on the petal length-petal width plane). We will also distinguish between species using different colour. Here’s the ggplot code to do this:

And here’s the plot:

On this plane we see a clear linear boundary between *setosa* and the other two species, *versicolor* and *virginica*. The boundary between the latter two is almost linear. Since there are four predictors, one would have to plot the other combinations to get a better feel for the data. I’ll leave this as an exercise for you and move on with the assumption that the data is nearly linearly separable. If the assumption is grossly incorrect, a linear SVM will not work well.

Up until now, we have discussed binary classification problem, i.e. those in which the predicted variable can take on only two values. In this case, however, the predicted variable, *Species*, can take on 3 values (setosa, versicolor and virginica). This brings up the question as to how the algorithm deals multiclass classification problems – i.e those involving datasets with more than two classes. The SVM algorithm does this using a *one-against-one *classification strategy. Here’s how it works:

- Divide the dataset (assumed to have N classes) into N(N-1)/2 datasets that have two classes each.
- Solve the binary classification problem for each of these subsets
- Use a simple voting mechanism to assign a class to each data point.

Basically, each data point is assigned the most frequent classification it receives from all the binary classification problems it figures in.

With that said, let’s get on with building the classifier. As before, we begin by splitting the data into training and test sets using an 80/20 random split. Here is the code to do this:

Then we build the model (default cost) and examine it:

The main thing to note is that the function call is identical to the binary classification case. We get some basic information about the model by typing in the model name as before:

kernel = “linear”)

And the train and test accuracies are computed in the usual way:

This looks good, but is potentially misleading because it is for a particular train/test split. Remember, in this case, unlike the earlier example, we do not know the shape of the actual decision boundary. So, to get a robust measure of accuracy, we should calculate the average test accuracy over a number of train/test partitions. Here’s some code to do that:

Which is not too bad at all, indicating that the dataset is indeed nearly linearly separable. If you try different values of *cost* you will see that it does not make much difference to the average accuracy.

This is a good note to close this piece on. Those who have access to DataCamp premium courses will find that the content above is covered in chapters 1 and 2 of the course on support vector machines in R. The next article in this two-part series will cover chapters 3 and 4.

## Summarising

My main objective in this article was to help develop an intuition for how SVMs work in simple cases. We illustrated the basic principles and terminology with a simple 1 dimensional example and then worked our way to linearly separable binary classification problems with multiple predictors. We saw how the latter can be solved using a popular svm implementation available in R. We also saw that the algorithm can handle multiclass problems. All through, we used visualisations to see what the algorithm does and how the key parameters affect the decision boundary and margins.

In the next part (yet to be written) we will see how SVMs can be generalised to deal with complex, nonlinear decision boundaries. In essence, the use a mathematical trick to “linearise” these boundaries. We’ll delve into details of this trick in an intuitive, visual way as we have done here.

Many thanks for reading!