Eight to Late

Sensemaking and Analytics for Organizations

A gentle introduction to data visualisation using R

with 5 comments

Data science students often focus on machine learning algorithms, overlooking some of the more routine but important skills of the profession.  I’ve lost count of the number of times I have advised students working on projects for industry clients to curb their keenness to code and work on understanding the data first.  This is important because, as people (ought to) know, data doesn’t speak for itself, it has to be given a voice; and as data-scarred professionals know from hard-earned experience, one of the best ways to do this is through visualisation.

Data visualisation is sometimes (often?) approached as a bag of tricks to be learnt individually, with no little or no reference to any underlying principles. Reading Hadley Wickham’s paper on the grammar of graphics was an epiphany; it showed me how different types of graphics can be constructed in a consistent way using common elements. Among other things, the grammar makes visualisation a logical affair rather than a set of tricks. This post is a brief – and hopefully logical – introduction to visualisation using ggplot2, Wickham’s implementation of a grammar of graphics.

In keeping with the practical bent of this series we’ll  focus on worked examples, illustrating elements of the grammar as we go along. We’ll first briefly describe the elements of the grammar and then show how these are used to build different types of visualisations.

A grammar of graphics

Most visualisations are constructed from common elements that are pieced together in prescribed ways.  The elements can be grouped into the following categories:

  • Data – this is obvious, without data there is no story to tell and definitely no plot!
  • Mappings – these are correspondences between data and display elements such as spatial location, shape or colour. Mappings are referred to as aesthetics in Wickham’s grammar.
  • Scales – these are transformations (conversions) of data values to numbers that can be displayed on-screen. There should be one scale per mapping. ggplot typically does the scaling transparently, without users having to worry about it. One situation in which you might need to mess with default scales is when you want to zoom in on a particular range of values. We’ll see an example or two of this later in this article.
  • Geometric objects – these specify the geometry of the visualisation. For example, in ggplot2 a scatter plot is specified via a point geometry whereas a fitting curve is represented by a smooth geometry. ggplot2 has a range of geometries available of which we will illustrate just a few.
  • Coordinate system – this specifies the system used to position data points on the graphic. Examples of coordinate systems are Cartesian and polar. We’ll deal with Cartesian systems in this tutorial. See this post for a nice illustration of how one can use polar plots creatively.
  • Facets – a facet specifies how data can be split over multiple plots to improve clarity. We’ll look at this briefly towards the end of this article.

The basic idea of a layered grammar of graphics is that each of these elements can be combined – literally added layer by layer – to achieve a desired visual result. Exactly how this is done will become clear as we work through some examples. So without further ado, let’s get to it.

Hatching (gg)plots

In what follows we’ll use the NSW Government Schools dataset,  made available via the state government’s open data initiative. The data is in csv format. If you cannot access the original dataset from the aforementioned link, you can download an Excel file with the data here (remember to save it as a csv before running the code!).

The first task – assuming that you have a working R/RStudio environment –  is to load the data into R. To keep things simple we’ll delete a number of columns (as shown in the code) and keep only  rows that are complete, i.e. those that have no missing values. Here’s the code:

#set working directory if needed (modify path as needed)
#setwd(“C:/Users/Kailash/Documents/ggplot”)
#load required library
library(ggplot2)
#load dataset (ensure datafile is in directory!)
nsw_schools <- read.csv(“NSW-Public-Schools-Master-Dataset-07032017.csv”)
#build expression for columns to delete
colnames_for_deletion <- paste0(“AgeID|”,”street|”,”indigenous_pct|”,
“lbote_pct|”,”opportunity_class|”,”school_specialty_type|”,
“school_subtype|”,”support_classes|”,”preschool_ind|”,
“distance_education|”,”intensive_english_centre|”,”phone|”,
“school_email|”,”fax|”,”late_opening_school|”,
“date_1st_teacher|”,”lga|”,”electorate|”,”fed_electorate|”,
“operational_directorate|”,”principal_network|”,
“facs_district|”,”local_health_district|”,”date_extracted”)
#get indexes of cols for deletion
cols_for_deletion <- grep(colnames_for_deletion,colnames(nsw_schools))
#delete them
nsw_schools <- nsw_schools[,-cols_for_deletion]
#structure and number of rows
str(nsw_schools)
nrow(nsw_schools)
#remove rows with NAs
nsw_schools <- nsw_schools[complete.cases(nsw_schools),]
#rowcount
nrow(nsw_schools)
#convert student number to numeric datatype.
#Need to convert factor to character first…
#…alternately, load data with StringsAsFactors set to FALSE
nsw_schools$student_number <- as.numeric(as.character(nsw_schools$student_number))
#a couple of character strings have been coerced to NA. Remove these
nsw_schools <- nsw_schools[complete.cases(nsw_schools),]

A note regarding the last line of code above, a couple of schools have “np” entered for the student_number variable. These are coerced to NA in the numeric conversion. The last line removes these two schools from the dataset.

Apart from student numbers and location data, we have retained level of schooling (primary, secondary etc.) and ICSEA ranking. The location information includes attributes such as suburb, postcode, region, remoteness as well as latitude and longitude. We’ll use only remoteness in this post.

The first thing that caught my eye in the data was was the ICSEA ranking.  Before going any further, I should mention that the  Australian Curriculum Assessment and Reporting Authority   (the  organisation responsible for developing the ICSEA system) emphasises that the score  is not a school ranking, but a measure of socio-educational advantage  of the student population in a school. Among other things, this is related to family background and geographic location.  The average ICSEA score is set at an average of 1000, which can be used as a reference level.

I thought a natural first step would be to see how ICSEA varies as a function of the other variables in the dataset such as student numbers and location (remoteness, for example). To begin with, let’s plot ICSEA rank as a function of student number. As it is our first plot, let’s take it step by step to understand how the layered grammar works. Here we go:

#specify data layer
p <- ggplot(data=nsw_schools)
#display plot
p

This displays a blank plot because we have not specified a mapping and geometry to go with the data. To get a plot we need to specify both. Let’s start with a scatterplot, which is specified via a point geometry.  Within the geometry function, variables are mapped to visual properties of the  using  aesthetic mappings. Here’s the code:

#specify a point geometry (geom_point)
p <- p + geom_point(mapping = aes(x=student_number,y=ICSEA_Value))
#…lo and behold our first plot
p

The resulting plot is shown in Figure 1.

Figure 1: Scatterplot of ICSEA score versus student numbers

At first sight there are two points that stand out: 1) there are fewer number of large schools, which we’ll look into in more detail later and 2) larger schools seem to have a higher ICSEA score on average.   To dig a little deeper into the latter, let’s add a linear trend line. We do that by adding another layer (geometry) to the scatterplot like so:

#add a trendline using geom_smooth
p <- p + geom_smooth(mapping= aes(x=student_number,y=ICSEA_Value),method=”lm”)
#scatter plot with trendline
p

The result is shown in Figure 2.

Figure 2: scatterplot of ICSEA vs student number with linear trendline

The lm method does a linear regression on the data.  The shaded area around the line is the 95% confidence level of the regression line (i.e that it is 95% certain that the true regression line lies in the shaded region). Note that geom_smooth   provides a range of smoothing functions including generalised linear and local regression (loess) models.

You may have noted that we’ve specified the aesthetic mappings in both geom_point and geom_smooth. To avoid this duplication, we can simply specify the mapping, once in the top level ggplot call (the first layer) like so:

#rewrite the above, specifying the mapping in the ggplot call instead of geom
p <- ggplot(data=nsw_schools,mapping= aes(x=student_number,y=ICSEA_Value)) +
geom_point()+
geom_smooth(method=”lm”)
#display plot, same as Fig 2
p

From Figure 2, one can see a clear positive correlation between student numbers and ICSEA scores, let’s zoom in around the average value (1000) to see this more clearly…

#set display to 900 < y < 1100
p <- p + coord_cartesian(ylim =c(900,1100))
#display plot
p

The coord_cartesian function is used to zoom the plot to without changing any other settings. The result is shown in Figure 3.

Figure 3: Zoomed view of Figure 2 for 900 < ICSEA <1100

To  make things clearer, let’s add a reference line at the average:

#add horizontal reference line at the avg ICSEA score
p <- p + geom_hline(yintercept=1000)
#display plot
p

The result, shown in Figure 4, indicates quite clearly that larger schools tend to have higher ICSEA scores. That said, there is a twist in the tale which we’ll come to a bit later.

Figure 4: Zoomed view with reference line at average value of ICSEA

As a side note, you would use geom_vline to zoom in on a specific range of x values and geom_abline to add a reference line with a specified slope and intercept. See this article on ggplot reference lines for more.

OK, now that we have seen how ICSEA scores vary with student numbers let’s switch tack and incorporate another variable in the mix.  An obvious one is remoteness. Let’s do a scatterplot as in Figure 1, but now colouring each point according to its remoteness value. This is done using the colour aesthetic as shown below:

#Map aecg_remoteness to colour aesthetic
p <- ggplot(data=nsw_schools, aes(x=student_number,y=ICSEA_Value,  colour=ASGS_remoteness)) +
geom_point()
#display plot
p

The resulting plot is shown in Figure 5.

Figure 5: ICSEA score as a function of student number and remoteness category

Aha, a couple of things become apparent. First up, large schools tend to be in metro areas, which makes good sense. Secondly, it appears that metro area schools have a distinct socio-educational advantage over regional and remote area schools. Let’s add trendlines by remoteness category as well to confirm that this is indeed so:

#add reference line at avg + trendlines for each remoteness category
p <- p + geom_hline(yintercept=1000) + geom_smooth(method=”lm”)
#display plot
p

The plot, which is shown in Figure 6, indicates clearly that  ICSEA scores decrease on the average as we move away from metro areas.

Figure 6: ICSEA scores vs student numbers and remoteness, with trendlines for each remoteness category

Moreover, larger schools metropolitan areas tend to have higher than average scores (above 1000),  regional areas tend to have lower than average scores overall, with remote areas being markedly more disadvantaged than both metro and regional areas.  This is no surprise, but the visualisations show just how stark the  differences are.

It is also interesting that, in contrast to metro and (to some extent) regional areas, there negative correlation between student numbers and scores for remote schools. One can also use  local regression to get a better picture of how ICSEA varies with student numbers and remoteness. To do this, we simply use the loess method instead of lm:

#redo plot using loess smoothing instead of lm
p <- ggplot(data=nsw_schools, aes(x=student_number,y=ICSEA_Value, colour=ASGS_remoteness)) +
geom_point() + geom_hline(yintercept=1000) + geom_smooth(method=”loess”)
#display plot
p

The result, shown in Figure 7,  has  a number  of interesting features that would have been worth pursuing further were we analysing this dataset in a real life project.  For example, why do small schools tend to have lower than benchmark scores?

Figure 7: ICSEA scores vs student numbers and remoteness with loess regression curves.

From even a casual look at figures 6 and 7, it is clear that the confidence intervals for remote areas are huge. This suggests that the number of datapoints for these regions are a) small and b) very scattered.  Let’s quantify the number by getting counts using the table function (I know, we could plot this too…and we will do so a little later). We’ll also transpose the results using data.frame to make them more readable:

#get school counts per remoteness category
data.frame(table(nsw_schools$ASGS_remoteness))
Var1 Freq
1 0
2 Inner Regional Australia 561
3 Major Cities of Australia 1077
4 Outer Regional Australia 337
5 Remote Australia 33
6 Very Remote Australia 14

The number of datapoints for remote regions is much less than those for metro and regional areas. Let’s repeat the loess plot with only the two remote regions. Here’s the code:

#create vector containing desired categories
remote_regions <- c(‘Remote Australia’,’Very Remote Australia’)
#redo loess plot with only remote regions included
p <- ggplot(data=nsw_schools[nsw_schools$ASGS_remoteness %in% remote_regions,], aes(x=student_number,y=ICSEA_Value, colour=ASGS_remoteness)) +
geom_point() + geom_hline(yintercept=1000) + geom_smooth(method=”loess”)
#display plot
p

The plot, shown in Figure 8, shows that there is indeed a huge variation in the (small number) of datapoints, and the confidence intervals reflect that. An interesting feature is that some small remote schools have above average scores. If we were doing a project on this data, this would be a feature worth pursuing further as it would likely be of interest to education policymakers.

Figure 8: Loess plots as in Figure 7 for remote region schools

Note that there is a difference in the x axis scale between Figures 7 and 8 – the former goes from 0 to 2000 whereas the  latter goes up to 400 only. So for a fair comparison, between remote and other areas, you may want to re-plot Figure 7, zooming in on student numbers between 0 and 400 (or even less). This will also enable you to see the complicated dependence of scores on student numbers more clearly across all regions.

We’ll leave the scores vs student numbers story there and move on to another  geometry – the well-loved bar chart. The first one is a visualisation of the remoteness category count that we did earlier. The relevant geometry function is geom_bar, and the code is as easy as:

#frequency plot
p <- ggplot(data=nsw_schools, aes(x=ASGS_remoteness)) + geom_bar()
#display plot
p

The plot is shown in Figure 9.

Figure 9: School count by remoteness categories

The category labels on the x axis are too long and look messy. This can be fixed by tilting them to a 45 degree angle so that they don’t run into each other as they most likely did when you ran the code on your computer. This is done by modifying the axis.text element of the plot theme. Additionally, it would be nice to get counts on top of each category bar. The way to do that is using another geometry function, geom_text. Here’s the code incorporating the two modifications:

#frequency plot
p <- p + geom_text(stat=’count’,aes(label= ..count..),vjust=-1)+
theme(axis.text.x=element_text(angle=45, hjust=1))
#display plot
p

The result is shown in Figure 10.

Figure 10: Bar plot of remoteness with counts and angled x labels

Some things to note: : stat=count tells ggplot to compute counts by category and the aesthetic label = ..count.. tells ggplot to access the internal variable that stores those counts. The the vertical justification setting, vjust=-1, tells ggplot to display the counts on top of the bars. Play around with different values of vjust to see how it works. The code to adjust label angles is self explanatory.

It would be nice to reorder the bars by frequency. This is easily done via fct_infreq function in the forcats package like so:

#use factor tools
library(forcats)
#descending
p <- ggplot(data=nsw_schools) +
geom_bar(mapping = aes(x=fct_infreq(ASGS_remoteness)))+
theme(axis.text.x=element_text(angle=45, hjust=1))
#display plot
p

The result is shown in Figure 11.

Figure 11: Barplot of Figure 10 sorted by descending count

To reverse the order, invoke fct_rev, which reverses the sort order:

#reverse sort order to ascending
p <- ggplot(data=nsw_schools) +
geom_bar(mapping = aes(x=fct_rev(fct_infreq(ASGS_remoteness))))+
theme(axis.text.x=element_text(angle=45, hjust=1))
#display plot
p

The resulting plot is shown in Figure 12.

Figure 12: Bar plot of Figure 10 sorted by ascending count

If this is all too grey for us, we can always add some colour. This is done using the fill aesthetic as follows:

#add colour using the fill aesthetic
p <- ggplot(data=nsw_schools) +
geom_bar(mapping = aes(x=ASGS_remoteness, fill=ASGS_remoteness))+
theme(axis.text.x=element_text(angle=45, hjust=1))
#display plot
p

The resulting plot is shown in Figure 13.

Figure 13: Coloured bar plot of school count by remoteness

Note that, in the above, that we have mapped fill and x to the same variable, remoteness which makes the legend superfluous. I will leave it to you to figure out how to suppress the legend – Google is your friend.

We could also map fill to another variable, which effectively adds another dimension to the plot. Here’s how:

#map fill to another variable
p <- ggplot(data=nsw_schools) +
geom_bar(mapping = aes(x=ASGS_remoteness, fill=level_of_schooling))+
theme(axis.text.x=element_text(angle=45, hjust=1))
#display plot
p

The plot is shown in Figure 14. The new variable, level of schooling, is displayed via proportionate coloured segments stacked up in each bar. The default stacking is one on top of the other.

Figure 14: Bar plot of school counts as a function of remoteness and school level

Alternately, one can stack them up side by side by setting the position argument to dodge as follows:

#stack side by side
p <- ggplot(data=nsw_schools) +
geom_bar(mapping = aes(x=ASGS_remoteness,fill=level_of_schooling),position =”dodge”)+
theme(axis.text.x=element_text(angle=45, hjust=1))
#display plot
p

The plot is shown in Figure 15.

Figure 15: Same data as in Figure 14 stacked side-by-side

Finally, setting the  position argument to fill  normalises the bar heights and gives us the proportions of level of schooling for each remoteness category. That sentence will  make more sense when you see Figure 16 below. Here’s the code, followed by the figure:

#proportion plot
p <- ggplot(data=nsw_schools) +
geom_bar(mapping = aes(x=ASGS_remoteness,fill=level_of_schooling),position = “fill”)+
theme(axis.text.x=element_text(angle=45, hjust=1))
#display plot
p

Obviously,  we lose frequency information since the bar heights are normalised.

Figure 16: Proportions of school levels for remoteness categories

An  interesting feature here is that  the proportion of central and community schools increases with remoteness. Unlike primary and secondary schools, central / community schools provide education from Kindergarten through Year 12. As remote areas have smaller numbers of students, it makes sense to consolidate educational resources in institutions that provide schooling at all levels .

Finally, to close the loop so to speak,  let’s revisit our very first plot in Figure 1 and try to simplify it in another way. We’ll use faceting to  split it out into separate plots, one per remoteness category. First, we’ll organise the subplots horizontally using facet_grid:

#faceting – subplots laid out horizontally (faceted variable on right of formula)
p <- ggplot(data=nsw_schools) + geom_point(mapping = aes(x=student_number,y=ICSEA_Value))+
facet_grid(~ASGS_remoteness)
#display plot
p

The plot is shown in Figure 17 in which the different remoteness categories are presented in separate plots (facets) against a common y axis. It shows, the sharp differences between student numbers between remote and other regions.

Figure 17: Horizontally laid out facet plots of ICSEA scores for different remoteness categories

To get a vertically laid out plot, switch the faceted variable to other side of the formula (left as an exercise for you).

If one has too many categories to fit into a single row, one can wrap the facets using facet_wrap like so:

#faceting – wrapping facets in 2 columns
p <- ggplot(data=nsw_schools) +
geom_point(mapping = aes(x=student_number,y=ICSEA_Value))+
facet_wrap(~ASGS_remoteness, ncol= 2)
#display plot
p

The resulting plot is shown in Figure 18.

Figure 18: Same data as in Figure 17, with facets wrapped in a 2 column format

One can specify the number of rows instead of columns. I won’t illustrate that as the change in syntax is quite obvious.

…and I think that’s a good place to stop.

Wrapping up

Data visualisation has a reputation of being a dark art, masterable only by the visually gifted. This may have been partially true some years ago, but in this day and age it definitely isn’t. Versatile  packages such as ggplot, that use a consistent syntax have made  the art much more accessible to visually ungifted folks like myself. In this post I have attempted to provide a brief and (hopefully) logical introduction to ggplot.  In closing I note that  although some of the illustrative examples  violate the  principles of good data visualisation, I hope this article will serve its primary purpose which is pedagogic rather than artistic.

Further reading:

Where to go for more? Two of the best known references are Hadley Wickham’s books:

I highly recommend his R for Data Science , available online here. Apart from providing a good overview of ggplot, it is an excellent introduction to R for data scientists.  If you haven’t read it, do yourself a favour and buy it now.

People tell me his ggplot book is an excellent book for those wanting to learn the ins and outs of ggplot . I have not read it myself, but if his other book is anything to go by, it should be pretty damn good.

Written by K

October 10, 2017 at 8:17 pm

5 Responses

Subscribe to comments with RSS.

  1. Good summary and intro. Thanks for posting!

    Like

    Hernán

    October 10, 2017 at 10:23 pm

  2. Try again with

    nsw_schools$student_number <- as.numeric(as.character(nsw_schools$student_number))

    Alternatively read in the data with stringsAsFactors = FALSE

    This correction changes most of the presentation.

    Like

    Bill Venables

    October 11, 2017 at 7:26 am

    • Hi Bill,

      Thanks so much for taking the time to point out the error, I truly appreciate it. I have now fixed it and have changed the presentation as well.

      Regards,

      Kailash.

      Like

      K

      October 11, 2017 at 12:41 pm

  3. Great. Thanks.

    Like

    firstdiscipline

    October 11, 2017 at 9:30 am

  4. […] Data science students often focus on machine learning algorithms, overlooking some of the more routine but important skills of the profession.  I’ve lost count of the number of times I have advised students working on projects for industry clients to curb their keenness to code and work on understanding the… Read more: A gentle introduction to data visualisation using R […]

    Like


Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.