## Archive for the ‘**Predictive Analytics**’ Category

## An intuitive introduction to support vector machines using R – Part 1

About a year ago, I wrote a piece on support vector machines as a part of my *gentle introduction to data science R* series. So it is perhaps appropriate to begin this piece with a few words about my motivations for writing yet another article on the topic.

Late last year, a curriculum lead at DataCamp got in touch to ask whether I’d be interested in developing a course on SVMs for them.

My answer was, obviously, an enthusiastic “Yes!”

Instead of rehashing what I had done in my previous article, I thought it would be interesting to try an approach that focuses on building an intuition for how the algorithm works using examples of increasing complexity, supported by visualisation rather than math. This post is the first part of a two-part series based on this approach.

The article builds up some basic intuitions about support vector machines (abbreviated henceforth as SVM) and then focuses on linearly separable problems. Part 2 (to be released at a future date) will deal with radially separable and more complex data sets. The focus throughout is on developing an understanding *what* the algorithm does rather than the technical details of how it does it.

Prerequisites for this series are a basic knowledge of R and some familiarity with the *ggplot* package. However, even if you don’t have the latter, you should be able to follow much of what I cover so I encourage you to press on regardless.

<advertisement> if you have a DataCamp account, you may want to check out my course on support vector machines using R. Chapters 1 and 2 of the course closely follow the path I take in this article. </advertisement>

### A one dimensional example

A soft drink manufacturer has two brands of their flagship product: *Choke* (sugar content of 11g/100ml) and *Choke-R* (sugar content 8g/100 ml). The actual sugar content can vary quite a bit in practice so it can sometimes be hard to figure out the brand given the sugar content. Given sugar content data for 25 samples taken randomly from both populations (see file sugar_content.xls), our task is to come up with a decision rule for determining the brand.

Since this is one-variable problem, the simplest way to discern if the samples fall into distinct groups is through visualisation. Here’s one way to do this using *ggplot*:

…and here’s the resulting plot:

Note that we’ve simulated a one-dimensional plot by setting all the y values to 0.

From the plot, it is evident that the samples fall into distinct groups: low sugar content, bounded above by the 8.8 g/100ml sample and high sugar content, bounded below by the 10 g/100ml sample.

Clearly, any point that lies between the two points is an acceptable *decision boundary.* We could, for example, pick 9.1g/100ml and 9.7g/100ml. Here’s the R code with those points added in. Note that we’ve made the points a bit bigger and coloured them red to distinguish them from the sample points.

label=d_bounds$sep, size=2.5,

vjust=2, hjust=0.5, colour=”red”)

And here’s the plot:

Now, a bit about the decision rule. Say we pick the first point as the decision boundary, the decision rule would be:

Say we pick 9.1 as the decision boundary, our classifier (in R) would be:

The other one is left for you as an exercise.

Now, it is pretty clear that although either these points define an acceptable decision boundary, neither of them are the best. Let’s try to formalise our intuitive notion as to why this is so.

The *margin* is the distance between the points in both classes that are closest to the decision boundary. In case at hand, the margin is 1.2 g/100ml, which is the difference between the two extreme points at 8.8 g/100ml (Choke-R) and 10 g/100ml (Choke). It should be clear that the best separator is the one that lies halfway between the two extreme points. This is called the *maximum margin separator*. The maximum margin separator in the case at hand is simply the average of the two extreme points:

geom_point(data=mm_sep,aes(x=mm_sep$sep, y=c(0)), colour=”blue”, size=4)

And here’s the plot:

We are dealing with a one dimensional problem here so the decision boundary is a point. In a moment we will generalise this to a two dimensional case in which the boundary is a straight line.

Let’s close this section with some general points.

Remember this is a sample not the entire population, so it is quite possible (indeed likely) that there will be as yet unseen samples of Choke-R and Choke that have a sugar content greater than 8.8 and less than 10 respectively. So, the best classifier is one that lies at the greatest possible distance from both classes. The maximum margin separator is that classifier.

This toy example serves to illustrate the main aim of SVMs, which is to find an optimal separation boundary in the sense described here. However, doing this for real life problems is not so simple because life is not one dimensional. In the remainder of this article and its yet-to-be-written sequel, we will work through examples of increasing complexity so as to develop a good understanding of how SVMs work in addition to practical experience with using the popular SVM implementation in R.

<Advertisement> Again, for those of you who have DataCamp premium accounts, here is a course that covers pretty much the entire territory of this two part series. </Advertisement>

### Linearly separable case

The next level of complexity is a two dimensional case (2 predictors) in which the classes are separated by a straight line. We’ll create such a dataset next.

Let’s begin by generating 200 points with attributes x1 and x2, randomly distributed between 0 and 1. Here’s the R code:

Let’s visualise the generated data using a scatter plot:

And here’s the plot

Now let’s classify the points that lie above the line *x1=x2* as belonging to the class +1 and those that lie below it as belonging to class -1 (the class values are arbitrary choices, I could have chosen them to be anything at all). Here’s the R code:

Let’s modify the plot in Figure 4, colouring the points classified as +1n blue and those classified -1 red. For good measure, let’s also add in the decision boundary. Here’s the R code:

Note that the parameters in *geom_abline()* are derived from the fact that the line* x1=x2* has slope 1 and y intercept 0.

Here’s the resulting plot:

Next let’s introduce a margin in the dataset. To do this, we need to exclude points that lie within a specified distance of the boundary. A simple way to approximate this is to exclude points that have *x1* and *x2* values that differ by less a pre-specified value, *delta*. Here’s the code to do this with *delta* set to 0.05 units.

The check on the number of datapoints tells us that a number of points have been excluded.

Running the *previous* *ggplot* code block yields the following plot which clearly shows the reduced dataset with the depopulated region near the decision boundary:

Let’s add the margin boundaries to the plot. We know that these are parallel to the decision boundary and lie delta units on either side of it. In other words, the margin boundaries have slope=1 and y intercepts *delta* and –*delta*. Here’s the *ggplot* code:

And here’s the plot with the margins:

OK, so we have constructed a dataset that is *linearly separable*, which is just a short code for saying that the classes can be separated by a straight line. Further, the dataset has a margin, i.e. there is a “gap” so to speak, between the classes. Let’s save the dataset so that we can use it in the next section where we’ll take a first look at the *svm()* function in the *e1071* package.

That done, we can now move on to…

### Linear SVMs

Let’s begin by reading in the datafile we created in the previous section:

We then split the data into training and test sets using an 80/20 random split. There are many ways to do this. Here’s one:

The next step is to build the an SVM classifier model. We will do this using the *svm() f*unction which is available in the *e1071* package. The *svm()* function has a range of parameters. I explain some of the key ones below, in particular, the following parameters: *type*, *cost*, *kernel* and *scale*. It is recommended to have a browse of the documentation for more details.

The *type* parameter specifies the algorithm to be invoked by the function. The algorithm is capable of doing both classification and regression. We’ll focus on classification in this article. Note that there are two types of classification algorithms, nu and C classification. They essentially differ in the way that they penalise margin and boundary violations, but can be shown to lead to equivalent results. We will stick with C classification as it is more commonly used. The “C” refers to the cost which we discuss next.

The *cost* parameter specifies the penalty to be applied for boundary violations. This parameter can vary from 0 to infinity (in practice a large number compared to 0, say 10^6 or 10^8). We will explore the effect of varying *cost* later in this piece. To begin with, however, we will leave it at its default value of 1.

The *kernel* parameter specifies the kind of function to be used to construct the decision boundary. The options are linear, polynomial and radial. In this article we’ll focus on linear kernels as we know the decision boundary is a straight line.

The *scale* parameter is a Boolean that tells the algorithm whether or not the datapoints should be scaled to have zero mean and unit variance (i.e. shifted by the mean and scaled by the standard deviation). Scaling is generally good practice to avoid undue influence of attributes that have unduly large numeric values. However, in this case we will avoid scaling as we know the attributes are bounded and (more important) we would like to plot the boundary obtained from the algorithm manually.

Building the model is a simple one-line call, setting appropriate values for the parameters:

We expect a linear model to perform well here since the dataset it is linear by construction. Let’s confirm this by calculating training and test accuracy. Here’s the code:

The perfect accuracies confirm our expectation. However, accuracies by themselves are misleading because the story is somewhat more nuanced. To understand why, let’s plot the predicted decision boundary and margins using *ggplot*. To do this, we have to first extract information regarding these from the svm model object. One can obtain summary information for the model by typing in the model name like so:

kernel = “linear”, scale = FALSE)

Which outputs the following: the function *call*, SVM *type*, *kernel* and *cost* (which is set to its default). In case you are wondering about *gamma, * although it’s set to 0.5 here, it plays no role in linear SVMs. We’ll say more about it in the sequel to this article in which we’ll cover more complex kernels. More interesting are the *support vectors*. In a nutshell, these are training dataset points that *specify the location of the decision boundary*. We can develop a better understanding of their role by visualising them. To do this, we need to know their coordinates and indices (position within the dataset). This information is stored in the SVM model object. Specifically, the *index* element of *svm_model* contains the indices of the training dataset points that are support vectors and the *SV* element lists the coordinates of these points. The following R code lists these explicitly (Note that I’ve not shown the outputs in the code snippet below):

Let’s use the indices to visualise these points in the training dataset. Here’s the ggplot code to do that:

And here is the plot:

We now see that the support vectors are clustered around the boundary and, in a sense, serve to define it. We will see this more clearly by plotting the predicted decision boundary. To do this, we need its slope and intercept. These aren’t available directly available in the *svm_model*, but they can be extracted from the *coefs*, *SV* and *rho* elements of the object.

The first step is to use *coefs* and the support vectors to build the what’s called the *weight vector*. The *weight vector* is given by the product of the *coefs* matrix with the matrix containing the SVs. Note that the fact that only the support vectors play a role in defining the boundary is consistent with our expectation that the boundary should be fully specified by them. Indeed, this is often touted as a feature of SVMs in that it is one of the few classifiers that depends on only a small subset of the training data, i.e. the datapoints closest to the boundary rather than the entire dataset.

Once we have the weight vector, we can calculate the slope and intercept of the predicted decision boundary as follows:

Note that the slope and intercept are quite different from the correct values of 1 and 0 (reminder: the actual decision boundary is the line *x1=x2* by construction). We’ll see how to improve on this shortly, but before we do that, let’s plot the decision boundary using the slope and intercept we have just calculated. Here’s the code:

And here’s the augmented plot:

The plot clearly shows how the support vectors “support” the boundary – indeed, if one draws line segments from each of the points to the boundary in such a way that the intersect the boundary at right angles, the lines can be thought of as “holding the boundary in place”. Hence the term *support vector*.

This is a good time to mention that the *e1071* library provides a built-in plot method for *svm* function. This is invoked as follows:

The svm *plot *function takes a formula specifying the plane on which the boundary is to be plotted. This is not necessary here as we have only two predictors (x1 and x2) which automatically define a plane.

Here is the plot generated by the above code:

Note that the axes are switched (x1 is on the y axis). Aside from that, the plot is reassuringly similar to our *ggplot* version in Figure 9. Also note that that the support vectors are marked by “x”. Unfortunately the built in function does not display the margin boundaries, but this is something we can easily add to our home-brewed plot. Here’s how. We know that the margin boundaries are parallel to the decision boundary, so all we need to find out is their intercept. It turns out that the intercepts are offset by an amount *1/w[2]* units on either side of the decision boundary. With that information in hand we can now write the the code to add in the margins to the plot shown in Figure 9. Here it is:

geom_abline(slope=slope_1,intercept = intercept_1+1/w[2], linetype=”dashed”)

And here is the plot with the margins added in:

Note that the predicted margins are much wider than the actual ones (compare with Figure 7). As a consequence, many of the support vectors lie within the predicted margin – that is, they violate it. The upshot of the wide margin is that the decision boundary is not tightly specified. This is why we get a significant difference between the slope and intercept of predicted decision boundary and the actual one. We can sharpen the boundary by narrowing the margin. How do we do this? We make margin violations more expensive by increasing the *cost*. Let’s see this margin-narrowing effect in action by building a model with *cost = *100 on the same training dataset as before. Here is the code:

I’ll leave you to calculate the training and test accuracies (as one might expect, these will be perfect).

Let’s inspect the *cost=100* model:

kernel = “linear”,cost=100, scale = FALSE)

The number of support vectors is reduced from 55 to 6! We can plot these and the boundary / margin lines using *ggplot* as before. The code is identical to the previous case (see code block preceding Figure 8). If you run it, you will get the plot shown in Figure 12.

Since the boundary is more tightly specified, we would expect the slope and intercept of the predicted boundary to be considerably closer to their actual values of 1 and 0 respectively (as compared to the default cost case). Let’s confirm that this is so by calculating the slope and intercept as we did in the code snippets preceding Figure 9. Here’s the code:

Which nicely confirms our expectation.

The decision boundary and margins for the high cost case can also be plotted with the code shown earlier. Her it is for completeness:

geom_abline(slope=slope_100,intercept = intercept_100+1/w[2], linetype=”dashed”)

And here’s the plot:

SVMs that allow margin violations are called *soft margin classifiers* and those that do not are called *hard*. In this case, the hard margin classifier does a better job because it specifies the boundary more accurately than its soft counterpart. However, this does not mean that hard margin classifier are to be preferred over soft ones in all situations. Indeed, in real life, where we usually do not know the shape of the decision boundary upfront, soft margin classifiers can allow for a greater degree of uncertainty in the decision boundary thus improving generalizability of the classifier.

OK, so now we have a good feel for what the SVM algorithm does in the linearly separable case. We will round out this article by looking at a real world dataset that fortuitously turns out to be almost linearly separable: the famous (notorious?) iris dataset. It is instructive to look at this dataset because it serves to illustrate another feature of the *e1071* SVM algorithm – its capability to handle classification problems that have more than 2 classes.

### A multiclass problem

The iris dataset is well-known in the machine learning community as it features in many introductory courses and tutorials. It consists of 150 observations of 3 *species* of the iris flower – *setosa*, *versicolor* and *virginica*. Each observation consists of numerical values for 4 independent variables (predictors): petal length, petal width, sepal length and sepal width. The dataset is available in a standard installation of R as a built in dataset. Let’s read it in and examine its structure:

Now, as it turns out, petal length and petal width are the key determinants of species. So let’s create a scatterplot of the datapoints as a function of these two variables (i.e. project each data point on the petal length-petal width plane). We will also distinguish between species using different colour. Here’s the ggplot code to do this:

And here’s the plot:

On this plane we see a clear linear boundary between *setosa* and the other two species, *versicolor* and *virginica*. The boundary between the latter two is almost linear. Since there are four predictors, one would have to plot the other combinations to get a better feel for the data. I’ll leave this as an exercise for you and move on with the assumption that the data is nearly linearly separable. If the assumption is grossly incorrect, a linear SVM will not work well.

Up until now, we have discussed binary classification problem, i.e. those in which the predicted variable can take on only two values. In this case, however, the predicted variable, *Species*, can take on 3 values (setosa, versicolor and virginica). This brings up the question as to how the algorithm deals multiclass classification problems – i.e those involving datasets with more than two classes. The SVM algorithm does this using a *one-against-one *classification strategy. Here’s how it works:

- Divide the dataset (assumed to have N classes) into N(N-1)/2 datasets that have two classes each.
- Solve the binary classification problem for each of these subsets
- Use a simple voting mechanism to assign a class to each data point.

Basically, each data point is assigned the most frequent classification it receives from all the binary classification problems it figures in.

With that said, let’s get on with building the classifier. As before, we begin by splitting the data into training and test sets using an 80/20 random split. Here is the code to do this:

Then we build the model (default cost) and examine it:

The main thing to note is that the function call is identical to the binary classification case. We get some basic information about the model by typing in the model name as before:

kernel = “linear”)

And the train and test accuracies are computed in the usual way:

This looks good, but is potentially misleading because it is for a particular train/test split. Remember, in this case, unlike the earlier example, we do not know the shape of the actual decision boundary. So, to get a robust measure of accuracy, we should calculate the average test accuracy over a number of train/test partitions. Here’s some code to do that:

Which is not too bad at all, indicating that the dataset is indeed nearly linearly separable. If you try different values of *cost* you will see that it does not make much difference to the average accuracy.

This is a good note to close this piece on. Those who have access to DataCamp premium courses will find that the content above is covered in chapters 1 and 2 of the course on support vector machines in R. The next article in this two-part series will cover chapters 3 and 4.

## Summarising

My main objective in this article was to help develop an intuition for how SVMs work in simple cases. We illustrated the basic principles and terminology with a simple 1 dimensional example and then worked our way to linearly separable binary classification problems with multiple predictors. We saw how the latter can be solved using a popular svm implementation available in R. We also saw that the algorithm can handle multiclass problems. All through, we used visualisations to see what the algorithm does and how the key parameters affect the decision boundary and margins.

In the next part (yet to be written) we will see how SVMs can be generalised to deal with complex, nonlinear decision boundaries. In essence, the use a mathematical trick to “linearise” these boundaries. We’ll delve into details of this trick in an intuitive, visual way as we have done here.

Many thanks for reading!

## A gentle introduction to logistic regression and lasso regularisation using R

In this day and age of artificial intelligence and deep learning, it is easy to forget that simple algorithms can work well for a surprisingly large range of practical business problems. And the simplest place to start is with the granddaddy of data science algorithms: linear regression and its close cousin, logistic regression. Indeed, in his acclaimed MOOC and accompanying textbook, Yaser Abu-Mostafa spends a good portion of his time talking about linear methods, and with good reason too: linear methods are not only a good way to learn the key principles of machine learning, they can also be remarkably helpful in zeroing in on the most important predictors.

My main aim in this post is to provide a beginner level introduction to logistic regression using R and also introduce LASSO (Least Absolute Shrinkage and Selection Operator), a powerful feature selection technique that is very useful for regression problems. Lasso is essentially a regularization method. If you’re unfamiliar with the term, think of it as a way to reduce overfitting using less complicated functions (and if that means nothing to you, check out my prelude to machine learning). One way to do this is to toss out less important variables, after checking that they aren’t important. As we’ll discuss later, this can be done manually by examining p-values of coefficients and discarding those variables whose coefficients are not significant. However, this can become tedious for classification problems with many independent variables. In such situations, lasso offers a neat way to model the dependent variable while *automagically* *selecting significant variables by* *shrinking the coefficients of unimportant predictors to zero. A*ll this without having to mess around with p-values or obscure information criteria. How good is that?

### Why not linear regression?

In linear regression one attempts to model a dependent variable (i.e. the one being predicted) using the best straight line fit to a set of predictor variables. The best fit is usually taken to be one that minimises the root mean square error, which is the sum of square of the differences between the actual and predicted values of the dependent variable. One can think of logistic regression as the equivalent of linear regression for a classification problem. In what follows we’ll look at binary classification – i.e. a situation where the dependent variable takes on one of two possible values (Yes/No, True/False, 0/1 etc.).

First up, you might be wondering why one can’t use linear regression for such problems. The main reason is that classification problems are about determining *class membership* rather than predicting *variable values*, and linear regression is more naturally suited to the latter than the former. One could, in principle, use linear regression for situations where there is a natural ordering of categories like *High*, *Medium* and *Low* for example. However, one then has to map sub-ranges of the predicted values to categories. Moreover, since predicted values are potentially unbounded (in data as yet unseen) there remains a degree of arbitrariness associated with such a mapping.

Logistic regression sidesteps the aforementioned issues by modelling class probabilities instead. Any input to the model yields a number lying between 0 and 1, representing the probability of class membership. One is still left with the problem of determining the threshold probability, i.e. the probability at which the category flips from one to the other. By default this is set to p=0.5, but in reality it should be settled based on how the model will be used. For example, for a marketing model that identifies potentially responsive customers, the threshold for a positive event might be set low (much less than 0.5) because the client does not really care about mailouts going to a non-responsive customer (the negative event). Indeed they may be more than OK with it as there’s always a chance – however small – that a non-responsive customer will actually respond. As an opposing example, the cost of a false positive would be high in a machine learning application that grants access to sensitive information. In this case, one might want to set the threshold probability to a value closer to 1, say 0.9 or even higher. The point is, the setting an appropriate threshold probability is a business issue, not a technical one.

### Logistic regression in brief

So how does logistic regression work?

For the discussion let’s assume that the outcome (predicted variable) and predictors are denoted by Y and X respectively and the two classes of interest are denoted by + and – respectively. We wish to model the conditional probability that the outcome Y is +, given that the input variables (predictors) are X. The conditional probability is denoted by p(Y=+|X) which we’ll abbreviate as p(X) since we know we are referring to the positive outcome Y=+.

As mentioned earlier, we are after the *probability of class membership* so we must ensure that the hypothesis function (a fancy word for the model) always lies between 0 and 1. The function assumed in logistic regression is:

You can verify that does indeed lie between 0 and 1 as varies from to . Typically, however, the values of that make sense are bounded as shown in the example (stolen from Wikipedia) shown in Figure 1. The figure also illustrates the typical S-shaped curve characteristic of logistic regression.

As an aside, you might be wondering where the name *logistic* comes from. An equivalent way of expressing the above equation is:

The quantity on the left is the logarithm of the odds. So, the model is a linear regression of the *log-odds*, sometimes called* logit*, and hence the name *logistic*.

The problem is to find the values of and that results in a that most accurately classifies all the observed data points – that is, those that belong to the positive class have a probability as close as possible to 1 and those that belong to the negative class have a probability as close as possible to 0. One way to frame this problem is to say that we wish to maximise the product of these probabilities, often referred to as the likelihood:

Where represents the products over i and j, which run over the +ve and –ve classed points respectively. This approach, called maximum likelihood estimation, is quite common in many machine learning settings, especially those involving probabilities.

It should be noted that in practice one works with the *log* likelihood because it is easier to work with mathematically. Moreover, one *minimises* the *negative * log likelihood which, of course, is the same as maximising the log likelihood. The quantity one minimises is thus:

However, these are technical details that I mention only for completeness. As you will see next, they have little bearing on the practical use of logistic regression.

### Logistic regression in R – an example

In this example, we’ll use the logistic regression option implemented within the *glm* function that comes with the base R installation. This function fits a class of models collectively known as *generalized linear models*. We’ll apply the function to the *Pima Indian Diabetes* dataset that comes with the mlbench package. The code is quite straightforward – particularly if you’ve read earlier articles in my “gentle introduction” series – so I’ll just list the code below noting that the logistic regression option is invoked by setting *family=”binomial”* in the *glm* function call.

Here we go:

Although this seems pretty good, we aren’t quite done because there is an issue that is lurking under the hood. To see this, let’s examine the information output from the model summary, in particular the coefficient estimates (i.e. estimates for ) and their significance. Here’s a summary of the information contained in the table:

- Column 2 in the table lists coefficient estimates.
- Column 3 list s the standard error of the estimates (the larger the standard error, the less confident we are about the estimate)
- Column 4 the z statistic (which is the coefficient estimate (column 2) divided by the standard error of the estimate (column 3)) and
- The last column (Pr(>|z|) lists the p-value, which is the probability of getting the listed estimate assuming the predictor has no effect. In essence, the
*smaller*the p-value, the*more significant*the estimate is likely to be.

From the table we can conclude that only 4 predictors are significant – *pregnant*, *glucose*, *mass *and *pedigree *(and possibly a fifth – *pressure*). The other variables have little predictive power and worse, may contribute to overfitting. They should, therefore, be eliminated and we’ll do that in a minute. However, there’s an important point to note before we do so…

In this case we have only 9 variables, so are able to identify the significant ones by a manual inspection of p-values. As you can well imagine, such a process will quickly become tedious as the number of predictors increases. Wouldn’t it be be nice if there were an algorithm that could somehow automatically shrink the coefficients of these variables or (better!) set them to zero altogether? It turns out that this is precisely what lasso and its close cousin, ridge regression, do.

### Ridge and Lasso

Recall that the values of the logistic regression coefficients and are found by minimising the negative log likelihood described in equation (3). Ridge and lasso regularization work by adding a *penalty term* to the log likelihood function. In the case of ridge regression, the penalty term is and in the case of lasso, it is (Remember, is a vector, with as many components as there are predictors). The quantity to be minimised in the two cases is thus:

– for ridge regression,

and

– for lasso regression.

Where is a free parameter which is usually selected in such a way that the resulting model minimises the out of sample error. Typically, the optimal value of is found using grid search with cross-validation, a process akin to the one described in my discussion on cost-complexity parameter estimation in decision trees. Most canned algorithms provide methods to do this; the one we’ll use in the next section is no exception.

In the case of ridge regression, the effect of the penalty term is to shrink the coefficients that contribute most to the error. Put another way, it reduces the magnitude of the coefficients that contribute to *increasing* . In contrast, in the case of lasso regression, the effect of the penalty term is to set the these coefficients *exactly to zero! *This is cool because what it mean that lasso regression works like a feature selector that picks out the most important coefficients, i.e. those that are most predictive (and have the lowest p-values).

Let’s illustrate this through an example. We’ll use the glmnet package which implements a combined version of ridge and lasso (called elastic net). Instead of minimising (4) or (5) above, glmnet minimises:

where controls the “mix” of ridge and lasso regularisation, with being “pure” ridge and being “pure” lasso.

### Lasso regularisation using glmnet

Let’s reanalyse the Pima Indian Diabetes dataset using glmnet with (pure lasso). Before diving into code, it is worth noting that glmnet:

- does not have a formula interface, so one has to input the predictors as a matrix and the class labels as a vector.
- does not accept categorical predictors, so one has to convert these to numeric values before passing them to glmnet.

The glmnet function model.matrix creates the matrix and also converts categorical predictors to appropriate dummy variables.

Another important point to note is that we’ll use the function cv.glmnet, which automatically performs a grid search to find the optimal value of .

OK, enough said, here we go:

The plot is shown in Figure 2 below:

The plot shows that the *log *of the optimal value of lambda (i.e. the one that minimises the root mean square error) is approximately -5. The exact value can be viewed by examining the variable *lambda_min* in the code below. In general though, the objective of regularisation is to balance accuracy *and *simplicity. In the present context, this means a model with the* smallest number of coefficients that also gives a good accuracy*. To this end, the cv.glmnet function finds the value of lambda that gives the simplest model but also lies within one standard error of the optimal value of lambda. This value of lambda (*lambda.1se*) is what we’ll use in the rest of the computation. Interested readers should have a look at this article for more on* lambda.1se* vs* lambda.min*.

The output shows that *only* those variables that we had determined to be significant on the basis of p-values have non-zero coefficients. The coefficients of all other variables have been set to zero by the algorithm! Lasso has reduced the complexity of the fitting function massively…and you are no doubt wondering what effect this has on accuracy. Let’s see by running the model against our test data:

Which is a bit less than what we got with the more complex model. So, we get a similar out-of-sample accuracy as we did before, and we do so using a way simpler function (4 non-zero coefficients) than the original one (9 nonzero coefficients). What this means is that the simpler function does at least as good a job fitting the signal in the data as the more complicated one. The bias-variance tradeoff tells us that the simpler function should be preferred because it is less likely to overfit the training data.

Paraphrasing William of Ockham: *all other things being equal, a simple hypothesis should be preferred over a complex one*.

### Wrapping up

In this post I have tried to provide a detailed introduction to logistic regression, one of the simplest (and oldest) classification techniques in the machine learning practitioners arsenal. Despite it’s simplicity (or I should say, because of it!) logistic regression works well for many business applications which often have a simple decision boundary. Moreover, because of its simplicity it is less prone to overfitting than flexible methods such as decision trees. Further, as we have shown, variables that contribute to overfitting can be eliminated using lasso (or ridge) regularisation, without compromising out-of-sample accuracy. Given these advantages and its inherent simplicity, it isn’t surprising that logistic regression remains a workhorse for data scientists.

## A prelude to machine learning

### What is machine learning?

The term *machine learning* gets a lot of airtime in the popular and trade press these days. As I started writing this article, I did a quick search for recent news headlines that contained this term. Here are the top three results with datelines within three days of the search:

The truth about hype usually tends to be quite prosaic and so it is in this case. Machine learning, as Professor Yaser Abu-Mostafa puts it, is simply about “*learning from data*.” And although the professor is referring to computers, this is so for humans too – we learn through patterns discerned from sensory data. As he states in the first few lines of his wonderful (but mathematically demanding!) book entitled, Learning From Data:

If you show a picture to a three-year-old and ask if there’s a tree in it, you will likely get a correct answer. If you ask a thirty year old what the definition of a tree is, you will likely get an inconclusive answer. We didn’t learn what a tree is by studying a [model] of what trees [are]. We learned by looking at trees. In other words, we learned from data.

In other words, the three year old forms a *model* of what constitutes a tree through a process of discerning a common pattern between all objects that grown-ups around her label “trees.” (the data). She can then “predict” that something is (or is not) a tree by applying this model to new instances presented to her.

This is exactly what happens in machine learning: the computer (or more correctly, the algorithm) builds a predictive model of a variable (like “treeness”) based on patterns it discerns in data. The model can then be applied to predict the value of the variable (e.g. is it a tree or not) in new instances.

With that said for an introduction, it is worth contrasting this machine-driven process of model building with the traditional approach of building mathematical models to predict phenomena as in, say, physics and engineering.

### What are models good for?

Physicists and engineers model phenomena using physical laws and mathematics. The aim of such modelling is both to understand and predict natural phenomena. For example, a physical law such as Newton’s Law of Gravitation is itself a model – it helps us understand how gravity works and make predictions about (say) where Mars is going to be six months from now. Indeed, all theories and laws of physics are but models that have wide applicability.

(*Aside*: Models are typically expressed via differential equations. Most differential equations are hard to solve analytically (or exactly), so scientists use computers to solve them numerically. It is important to note that in this case computers are used as *calculation tools*, they play no role in model-building.)

As mentioned earlier, the role of models in the sciences is twofold – *understanding* and *prediction*. In contrast, in machine learning the focus is usually on prediction rather than understanding. The predictive successes of machine learning have led certain commentators to claim that scientific theory building is obsolete and science can advance by crunching data alone. Such claims are overblown, not to mention, hubristic, for although a data scientist may be able to predict with accuracy, he or she may not be able to tell you why a particular prediction is obtained. This lack of understanding can mislead and can even have harmful consequences, a point that’s worth unpacking in some detail…

### Assumptions, assumptions

A model of a real world process or phenomenon is necessarily a simplification. This is essentially because it is impossible to isolate a process or phenomenon from the rest of the world. As a consequence it is impossible to know for certain that the model one has built has incorporated all the interactions that influence the process / phenomenon of interest. It is quite possible that potentially important variables have been overlooked.

The selection of variables that go into a model is based on *assumptions*. In the case of model building in physics, these assumptions are made upfront and are thus clear to anybody who takes the trouble to read the underlying theory. In machine learning, however, the assumptions are harder to see because they are implicit in the data and the algorithm. This can be a problem when data is biased or an algorithm opaque.

Problem of bias and opacity become more acute as datasets increase in size and algorithms become more complex, especially when applied to social issues that have serious human consequences. I won’t go into this here, but for examples the interested reader may want to have a look at Cathy O’Neil’s book, Weapons of Math Destruction, or my article on the dark side of data science.

As an aside, I should point out that although assumptions are usually obvious in traditional modelling, they are often overlooked out of sheer laziness or, more charitably, lack of awareness. This can have disastrous consequences. The global financial crisis of 2008 can – to some extent – be blamed on the failure of trading professionals to understand assumptions behind the model that was used to calculate the value of collateralised debt obligations.

### It all starts with a straight line….

Now that we’ve taken a tour of some of the key differences between model building in the old and new worlds, we are all set to start talking about machine learning proper.

I should begin by admitting that I overstated the point about opacity: there are some machine learning algorithms that are transparent as can possibly be. Indeed, chances are you know the algorithm I’m going to discuss next, either from an introductory statistics course in university or from plotting relationships between two variables in your favourite spreadsheet. Yea, you may have guessed that I’m referring to linear regression.

In its simplest avatar, linear regression attempts to fit a straight line to a set of data points in two dimensions. The two dimensions correspond to a dependent variable (traditionally denoted by ) and an independent variable (traditionally denoted by ). An example of such a fitted line is shown in Figure 1. Once such a line is obtained, one can “predict” the value of the dependent variable for any value of the independent variable. In terms of our earlier discussion, *the line is the model.*

Figure 1 also serves to illustrate that linear models are going to be inappropriate in most real world situations (the straight line does not fit the data well). But it is not so hard to devise methods to fit more complicated functions.

The important point here is that since machine learning is about finding functions that accurately predict dependent variables for as yet unknown values of the independent variables, most algorithms make explicit or implicit choices about the form of these functions.

### Complexity versus simplicity

At first sight it seems a no-brainer that complicated functions will work better than simple ones. After all, if we choose a nonlinear function with lots of parameters, we should be able to fit a complex data set better than a linear function can (See Figure 2 – the complicated function fits the datapoints better than the straight line). But there’s catch: although the ability to fit a dataset increases with the flexibility of the fitting function, increasing complexity beyond a point will invariably reduce predictive power. Put another way, a complex enough function may fit the known data points perfectly but, as a consequence, will inevitably perform poorly on unknown data. This is an important point so let’s look at it in greater detail.

Recall that the aim of machine learning is to predict values of the dependent variable for *as yet unknown* values of the independent variable(s). Given a finite (and usually, very limited) dataset, how do we build a model that we can have some confidence in? The usual strategy is to partition the dataset into two subsets, one containing 60 to 80% of the data (called the* training set*) and the other containing the remainder (called the *test set*). The model is then built – i.e. an appropriate function fitted – using the training data and verified against the test data. The verification process consists of comparing the predicted values of the dependent variable with the known values for the test set.

Now, it should be intuitively clear that the more complicated the function, the better it will fit the *training* data.

** Question**: Why?

** Answer**: Because complicated functions have more free parameters – for example, linear functions of a single (dependent) variable have two parameters (slope and intercept), quadratics have three, cubics four and so on. The mathematician, John von Neumann is believed to have said, “

*With four parameters I can fit an elephant, and with five I can make him wiggle his trunk*.” See this post for a nice demonstration of the literal truth of his words.

Put another way, complex functions are wrigglier than simple ones, and – by suitable adjustment of parameters – their “wriggliness” can be adjusted to fit the training data better than functions that are less wriggly. Figure 2 illustrates this point well.

This may sound like you can have your cake and eat it too: choose a complicated enough function and you can fit both the training and test data well. Not so! Keep in mind that the resulting model (fitted function) is built using the training set alone, so a good fit to the test data is not guaranteed. In fact, it is intuitively clear that a function that fits the training data perfectly (as in Figure 2) is likely to do a terrible job on the test data.

** Question**: Why?

** Answer**: Remember, as far as the model is concerned,

*the test data is unknown*. Hence, the greater the wriggliness in the trained model, the less likely it is to fit the test data well. Remember, once the model is fitted to the training data, you have

*no*freedom to tweak parameters any further.

This tension between simplicity and complexity of models is one of the key principles of machine learning and is called the bias-variance tradeoff. Bias here refers to lack of flexibility and variance, the reducible error. In general simpler functions have greater bias and lower variance and complex functions, the opposite. Much of the subtlety of machine learning lies in developing an understanding of how to arrive at the right level of complexity for the problem at hand – that is, how to tweak parameters so that the resulting function fits the training data just well enough so as to generalise well to unknown data.

**Note**: those who are curious to learn more about the bias-variance tradeoff may want to have a look at this piece. For details on how to achieve an optimal tradeoff, search for articles on *regularization in machine learning*.

### Unlocking unstructured data

The discussion thus far has focused primarily on quantitative or enumerable data (numbers and categories) that’s stored in a *structured* format – i.e. as columns and rows in a spreadsheet or database table). This is fine as it goes, but the fact is that much of the data in organisations is *unstructured*, the most common examples being text documents and audio-visual media. This data is virtually impossible to analyse computationally using relational database technologies (such as SQL) that are commonly used by organisations.

The situation has changed dramatically in the last decade or so. Text analysis techniques that once required expensive software and high-end computers have now been implemented in open source languages such as Python and R, and can be run on personal computers. For problems that require computing power and memory beyond that, cloud technologies make it possible to do so cheaply. In my opinion, the ability to analyse textual data is the most important advance in data technologies in the last decade or so. It unlocks a world of possibilities for the curious data analyst. Just think, all those comment fields in your survey data can now be analysed in a way that was never possible in the relational world!

There is a general impression that text analysis is hard. Although some of the advanced techniques can take a little time to wrap one’s head around, the basics are simple enough. Yea, I really mean that – for proof, check out my tutorial on the topic.

### Wrapping up

I could go on for a while. Indeed, I was planning to delve into a few algorithms of increasing complexity (from regression to trees and forests to neural nets) and then close with a brief peek at some of the more recent headline-grabbing developments like deep learning. However, I realised that such an exploration would be too long and (perhaps more importantly) defeat the main intent of this piece which is to give starting students an idea of what machine learning is about, and how it differs from preexisting techniques of data analysis. I hope I have succeeded, at least partially, in achieving that aim.

For those who are interested in learning more about machine learning algorithms, I can suggest having a look at my “*Gentle Introduction to Data Science using R*” series of articles. Start with the one on text analysis (link in last line of previous section) and then move on to clustering, topic modelling, naive Bayes, decision trees, random forests and support vector machines. I’m slowly adding to the list as I find the time, so please do check back again from time to time.

**Note**: This post is written as an introduction to the Data, Algorithms and Meaning subject that is part of the core curriculum of the Master of Data Science and Innovation program at UTS. I’m co-teaching the subject in Autumn 2018 with Alex Scriven and Rory Angus.