7  Data visualisation

Data visualisation is a powerful tool with many important uses. First, visualisations allow us to explore the data, identify potential outliers and errors, or check that the variables behave in the way we would expect them to if they had been recorded correctly. Visualisations can also be used as an analysis tool, allowing us to identify trends in the data or differences between groups. Finally, visualisations can help to convey messages to an audience in a clear, concise way that is often more powerful than presenting them using numbers or text. In some cases, data visualisations can show results so clearly that further analysis is arguably unnecessary.

7.1 Choosing the most appropriate visualisation

The most appropriate choice of visualisation depends first and foremost on the goal, the context, and the audience of the visualisation. This choice will also be influenced (or restricted) by the type of variable(s) we wish to display and the number of variables. Common plots used to display combinations of different types of data are given in following table:

Number of variables

Type of variables

Visualisation

geom object (or R function)

One variable

Categorical

Frequency table

table

Bar chart

geom_bar

Numerical

Histogram

geom_histogram

Spatial

Map

geom_sf

Temporal

Line plot

geom_line

Two variables

Two categorical

Frequency table

table

Stacked/side-by-side bar chart

geom_bar

One numeric, one categorical

Dot plot

geom_point

Box plot

geom_boxplot

Two numerical

Scatterplot

geom_point

> 2 variables

> 2 categorical

Table

table

2 numeric, one categorical or
> 2 numeric

Scatterplot with different colours/symbols/sizes

geom_point

For a more comprehensive list (including some non-standard graphs), visit the From data to viz website.

R is very flexible when it comes to visualising data and contains a wide variety of options to customise graphs. This section will focus on the tidyverse package ggplot2 and introduce some of the more commonly used graphical functions and parameters.

7.2 The ggplot2 package

The ggplot2 package implements the ‘grammar of graphics’, a system that aims to describe all statistical graphics in terms of their components or layers. All graphics can be broken down into the same components: the data, a coordinate system (or plot area) and some visual markings of the data. More complex plots may have additional layers but all must contain these three.

For example, if we want to investigate the distribution of tenure types between responses of the English Housing Survey (EHS), we could use a bar chart. The visual markings for a bar chart is a bar per group (in this case, tenure type), where the length of each bar represents the number of observations within that group.

For any visualisation created using ggplot2, we first use the ggplot function to create a coordinate system (a blank plot space) that we can add layers and objects to. Within this function, we specify the data that we wish to display on the coordinate system:

ggplot(data = ehs_tidy)

To add information to this graph, we add a geom layer: a visual representation of the data. There are many different geom objects built into the ggplot2 package (begin typing ?geom into the console to see a list). The geom_bar function is used to create bar charts.

Each geom object must contain a mapping argument, coupled with the aes function which defines how the variables in the dataset are visualised. In this case, we use the aes function to specify the grouping variable on the x axis, but it can also be used to set the colour, size or symbol based on variable values.

Warning

Although ggplot2 is part of the tidverse package, it uses a + symbol to add layers to visualisations rather than the pipe %>% we have been using in other packages.

ggplot(data = ehs_tidy) +
  geom_bar(mapping = aes(x = tenure_type))

Graphs appear in the plot tab in the bottom-right of the RStudio interface and can be opened in a new window using the zoom icon icon. Graphs in this window can also be copied and pasted into other documents using the export icon icon and selecting Copy to clipboard.

New graphs will replace existing ones in this window but all graphs created in the current session of R can be explored using the arrows icon icons.

Graphs can be stored as objects using the <- symbol. These objects can then be saved as picture or PDF files using the ggsave function:

tenure_bar <- ggplot(data = ehs_tidy) +
  geom_line(aes(x = tenure_type))

ggsave(tenure_bar, filename = "tenure_bar.png")

Exercise 7

  1. Choose an appropriate visualisation to check the distribution of the gross income variable from respondents of the English Housing survey. Comment on your findings.
  2. Based on the output from question 1, generate a summary table giving the minimum, maximum gross income, and an appropriate measure of the centre and spread of this variable.

7.3 Customising visualisations

Visual markings of a ggplot object can be customised as part of the geom function. Arguments that can be adjusted within these geoms include:

  • colour: change the colour (if point or line) or outline (if bar or histogram) of the geom
  • size: change the size of the markings (if point used)
  • shape: change the shape of markings (for points)
  • fill: change the colour of bars in bar charts or histograms
  • linewidth: change the line width
  • linetype: choose the type of line (e.g. dotted)
  • alpha: change the transparency of a geom

These options can be set manually or used to add variables to a visualisation. For example, the distribution of tenure types could be compared between regions by changing the fill of these bars, converting the bar chart into a stacked bar chart. When these options are determined by a variable in the data, they should be added inside the aes wrapper. Options can also be adjusted manually when the arguments are added outside of the aes wrapper.

To convert the previous bar chart into a stacked bar chart, we define fill by the region variable. To make these distinctions easier to see, we can also add a black outline to the bars by manually setting colour:

ggplot(data = ehs_tidy) + 
  # Define the x axis and fill inside aes
  geom_bar(aes(x = tenure_type, fill = region),
           # Manually define colour outside aes
           colour = "black")

Style tip

R contains a list of 657 pre-programmed colours that can be used to create palettes (run colours() in the console for a full list). Hexadecimal codes can also be included instead in the form #rrggbb (where rr (red), gg (green), and bb (blue) are numbers between 00 and 99 giving the level of intensity of each colour).

Each geom has different arguments that can be customised to adapt visualisations. For example, geom_bar has the position argument which controls how additional groups are displayed. By default, this argument is set to "stack" which created a stacked bar chart as we saw in the last example. An alternative would be to set this to position = "dodge" which creates a side-by-side bar chart. Here, the tenure type bars are separated into smaller bars per region, but are displayed next to one another, rather than on top of each other:

ggplot(data = ehs_tidy) + 
  # Define the x axis and fill inside aes
  geom_bar(aes(x = tenure_type, fill = region),
           # Manually define colour outside aes
           colour = "black",
           # Show bars side-by-side instead of stacked
           position = "dodge")

For a more comprehensive list of the options available for the geom you are interested in, check the helpfile (e.g. ?geom_bar).

Warning

Although it may be tempting to add many variables to the same visualisation, be sure that you are not overcomplicating the graph and losing important messages. It is better to have multiple clear (but simpler) visualisations than fewer confusing ones.

A note on colour in visualisations

The use of colour in visualisations can help highlight important messages, add variables to a graph, or make the visualisation stand out. However, unnecessary colours can distract from important data and make a message less clear.

Where colours are useful in a visualisation, ensure the choice of palette is inclusive and accessible. This means ensuring that colours are distinct to everyone, including those with a colour-vision deficiency, and avoid potentially harmful stereotypes.

To check a colour palette is inclusive, consider using a colour blindness simulator such as this one.

Exercise 8

Choose an appropriate visualisation to investigate the change in employment rate between 2008 and 2024. Generate this visualisation and comment on your findings.

7.4 Scale functions

Scale functions allow us to customise aesthetics defined in geom objects, such as colours and axes labels. They take the form scale_'aesthetic to customise'_'scale of variable’.

7.4.1 Customising axes

Scale functions can be used to customise axis titles, limits, breaks, and labels. The choice of scale function is determined by the type of variable displayed on the axis.

For example, if we wanted to customise the x axis of the line graph generated in the previous exercise, showing the employment rate over time, we would use the scale_x_continuous function. Arguments to customise the x or y axes include:

  • name = to change the axis title
  • limits = c(...) sets the axis limits
  • breaks = c(...) defines tick marks
  • labels = c(...) attaches labels to break values
  • transform = transforms the scale that the axis is shown on (for some older versions of ggplot2, this option is trans but has been overwritten in newer versions).

In this example, we can add labels to the x axis that shows which year the time variable represents, making it easier to interpret:

ggplot(data = obr_data) +
  geom_line(aes(x = time, y = employment_rate)) +
  # Add axis title
  scale_x_continuous(name = "Year", 
                     # Add breaks for each year
                     breaks = seq(1, 65, by = 4),
                     # Add labels to breaks
                     labels = 2008:2024)

7.4.2 Customising colour scales

There are a wide range of options for customising the colour aesthetics of geoms. These include pre-defined colour palettes, such as scale_colour_viridis_c for continuous variables, or scale_colour_viridis_d for discrete or categorical variables. Viridis colour palettes are designed to be colourblind friendly and print well in grey scale. There are also many R packages containing colour palettes for different scenarios. This website gives a list and preview of all palettes currently available.

Colour palettes can also be created manually for categorical variables using the scale_colour_manual function. Here, the argument values allows us to specify a colour per category.

Where a colour palette will be used across multiple plots, defining this list of colours as a vector and then entering this into scale_fill_manual will reduce repetitive coding. For example, where region is used to group across multiple plots, it will be useful to create a region colour palette:

region_palette <- c("aquamarine2", "blue", "chartreuse2", "coral", "orchid",
                    "firebrick", "gold3", "violetred", "grey50")

ggplot(data = ehs_tidy) + 
  # Define the x axis and fill inside aes
  geom_bar(aes(x = tenure_type, fill = region),
           # Manually define colour outside aes
           colour = "black",
           # Show bars side-by-side instead of stacked
           position = "dodge") +
  # Change legend title and add colour values
  scale_fill_manual(name = "Region", values = region_palette)

Palettes can also be created using gradients with the scale_colour_gradient function, that specifies a two colour gradient from low to high, scale_colour_gradient2 that creates a diverging gradient using low, medium, and high colours, and scale_colour_gradientn that creates an n-colour gradient.

7.5 Other labelling functions

Although axis and legend labels can be updated within scale functions, the labs function exist as an alternative. This function also allows us to add titles and subtitles to visualisations:

labs(x = “x-axis name”, y = “y-axis name”,
    colour = “Grouping variable name”, title = “Main title”,
    subtitle = “Subtitle”, caption = “Footnote”)

The annotate function allows us to add text and other objects to a ggplot object. For example:

annotate(“text”, x = 50, y = 200, label = “Text label here”)

Adds “Text label here” to a plot at the coordinates (50, 200) on a graph, and

annotate(“rect”, xmin = 0, xmax = 10, ymin = 20, ymax = 50, alpha = 0.2)

adds a rectangle to the graph.

Style tip

Make use of annotations in visualisations to enhance important messages and draw readers’ attention.

7.6 Theme functions

The theme function modifies non-data components of the visualisation. For example, the legend position, label fonts, the graph background, and gridlines. There are many options that exist within the theme function (use ?theme to list them all).

Note

Many of the elements that can be customised within the theme function require an element wrapper. This wrapper is determined by the type of object we are customising (e.g. element_text when customising text, element_rect when customising a background, element_blank to remove something). Check ?theme for more information.

One of the most common theme options is legend.position which can be used to move the legend to the top or bottom of the graph space (legend.position = “top” or legend.position = “bottom”) or remove the legend completely (legend.position = “none”).

ggplot also contains a number of pre-defined themes which change non-data elements of the plot to a programmed default. For example theme_void removes all gridlines and axes, theme_light changes the graph background white and the gridlines and axes light grey:

ggplot(data = ehs_tidy) + 
  # Define the x axis and fill inside aes
  geom_bar(aes(x = tenure_type, fill = region),
           # Manually define colour outside aes
           colour = "black",
           # Show bars side-by-side instead of stacked
           position = "dodge") +
  # Change legend title and add colour values
  scale_fill_manual(name = "Region", values = region_palette) + 
  # Remove gridlines and axes (not recommended!!)
  theme_void()

One benefit of using themes is that all visualisations will be consistent in terms of colour scheme, font size and gridlines. Although there are pre-built themes, we are able to create our own and save them as functions. These can then be used in place of R’s themes.

7.6.1 Creating functions

To create our own function in R, we first give it a name and attach function() followed by curly brackets {}, with the function defined inside those brackets.

For example, to create our own theme function, called theme_dluch, which sets the title font size to 18, the axis and legend titles to size 15, the axis and legend text to size 12, adds just gridlines to the y-axis, and changes the background colours, we use the following:

theme_dluch <- function() {
  # Change plot title size
  theme(plot.title = element_text(size = 18),
        # Change axis title size
        axis.title = element_text(size = 15),
        # Change axis text size
        axis.text = element_text(size = 12),
        # Change legend text
        legend.text = element_text(size = 12),
        legend.title = element_text(size = 15),
        # Change background
        plot.background = element_rect(fill = "thistle"),
        # Change legend background to match
        legend.background = element_rect(fill = "thistle"),
        # Change graph area background
        panel.background = element_rect(fill = "white", 
                                        colour = "black"),
        # Add tick major tick marks for the y axis (but not x)
        panel.grid.major.y = element_line(colour = "grey55"))
}

The function theme_dluch will now appear in the Environment window and can be added to ggplot objects:

ggplot(data = ehs_tidy) + 
  # Define the x axis and fill inside aes
  geom_bar(aes(x = tenure_type, fill = region),
           # Manually define colour outside aes
           colour = "black",
           # Show bars side-by-side instead of stacked
           position = "dodge") +
  # Change legend title and add colour values
  scale_fill_manual(name = "Region", values = region_palette) +
  # The change of font has made the labels unreadable, wrap them 
  # onto a new line when too long
  scale_x_discrete(labels = label_wrap_gen(12)) +
  labs(x = "Tenure type", y = "Count") +
  # Remove gridlines and axes (not recommended!!)
  theme_dluch()

Creating a personalised theme ensures that visualisations are consistent, whilst keeping code concise and reducing repetition.

A note on visualisation styling and accessibility

In all data visualisations, we want to make sure that the data are the most important element. We should aim to reduce all unnecessary clutter and design choices that do not enhance or add context to the data. This includes unnecessary patterns, colours, and gridlines.

Text must be legible to all readers, including those with visual impairments or learning difficulties. All text should be at least 12pt when the visualisation will be printed (e.g. in reports), or 36pt when included in a presentation. Do not choose a font family that is inaccessible.

7.7 Facet functions

Faceting allows us to divide a plot into subplots based on some grouping variable within the data. This allows us to show multiple variables in the same visualisation without risking overloading the plot and losing the intended message.

For example, we could compare the relationship between gross income and tenure type (shown using a boxplot) between regions by faceting the graph by region using the facet_wrap function:

Warning

Remember that the value 100,000 actually represents anyone earning £100,000 or more. To avoid skewing the data, we will remove these values and investigate trends below this threshold.

ehs_tidy %>% 
  # Remove gross income >= £100,000
  filter(gross_income != 100000) %>% 
  # Do not need to specify data, it is already passed through the pipes
  ggplot() +
  geom_boxplot(aes(x = tenure_type, y = gross_income)) +
  scale_x_discrete(labels = label_wrap_gen(12)) +
  labs(x = "Tenure type", y = "Gross income (£)") +
  facet_wrap( ~ region) 

Exercise 9

Use an appropriate visualisation to investigate the relationship between house prices and wages between 2008 and now. Ensure that the variables you choose are comparable and treat this as though the final product will be exported into a report (make sure it is clear and looks good!). Interpret your final graph.