install.packages("readxl")
6 Loading and tidying Excel data
In the first part of the course, we saw how SPSS data (.sav
) can be loaded into R using the haven
package. Another common format of data that cannot be loaded into base R is excel (.xlsx
). The package required to read Excel data is the readxl
package.
If this is the first time using the readxl
package, remember to install this to your machine using the install.packages
function:
Once this package has been installed, add it to the list of packages to install at the top of your script file:
::p_load(tidyverse, haven, readxl) pacman
6.1 Loading an Excel sheet into R
As with any file format, we must ensure data are in the correct form before loading them into R, ensuring each column represents a variable, each row represents an observation, and there are no tables or graphics.
Excel files can be a little tricker to manipulate than SPSS and CSV files as they often contain multiple sheets. This is the case for the data that we will be using for this part of the course.
The Excel file that we will be loading contains the Office for Budget Responsibility (OBR) economic and fiscal outlook. This contains many sheets of data, but for this course we will just be focusing on three:
- 1.6 Labour market
- 1.14 National living wage
- 1.17 Housing market
Let’s begin with housing market data, stored in the 18th sheet, labelled “1.17”. This sheet can be selected in the read_xlsx
function using the sheet
argument.
The housing market sheet shows information over different time scales: first by quarters, then years, and then across pairs of years. For this example, we will extract information measured quarterly (rows 4 - 88). The argument range
allows us to define the range of cells (by columns and rows) to extract.
Finally, we can see that the column headers are not in an appropriate format for R: they contain spaces, brackets, and are very long! There are two approaches we will consider to overcome this.
The first is to remove the column names completely (by not including them in the range
argument and setting col_names = FALSE
within the read_xlsx
function) and add them manually, using the setNames
function.
Setting names manually can take a long time and a lot of typing if there are many variables. An alternative to this manual approach is to include them in the range
of the read_xlsx
function, and use an R function to ‘clean’ them, making them follow the style guide.
The janitor
package has been designed to format inputed data to ensure it follows the Tidyverse style guide. The clean_names
function can be applied to a data frame or tibble to adapt variable names in this way.
The following code loads the housing market sheet and manually sets the variable names:
# Return file names from the data folder
list.files(path = "data")
[1] "Detailed_forecast_tables_Economy_March_2024.xlsx"
[2] "generalfs21_EUL.sav"
[3] "interviewfs21_EUL.sav"
<-
housing_market read_xlsx("data/Detailed_forecast_tables_Economy_March_2024.xlsx",
# Specify the sheet and range of cells to keep
sheet = "1.17",range = "B4:J88",
# Remove column names (too messy)
col_names = FALSE) %>%
# Add variable names manually
setNames(c("period", "hpi_comp15", "hpi_prev_year",
"residential_property_transactions",
"private_enterprise_housing_starts",
"private_enterprise_housing_comp",
"housing_stock", "net_additions_housing_stock",
"turnover_rate"))
The following code loads the same data but uses the janitor
package.
Do not run this code without installing and loading the janitor
package first. We will not run this during the course, but it is included for future reference.
<-
housing_market_alt read_xlsx("data/Detailed_forecast_tables_Economy_March_2024.xlsx",
# Specify the sheet and range of cells to keep
sheet = "1.17",range = "B3:J88") %>%
# Removes spaces, special characters, all lower case, etc.
clean_names()
6.2 Splitting variables
In the current dataset, the time variable is given as a character and so is not recognised as ordered or temporal by R. To overcome this, we can split the variable to create separate year and quarter variables.
The str_sub
function from tidyverse’s stringr
package extracts elements based on their position in a string of characters. This can be used to return the first 4 digits to a new year
variable, and the final digit to a new quarter
variable:
<- housing_market %>%
housing_market # Don't forget to convert the string to numberic
mutate(year = as.numeric(str_sub(period, start = 1, end = 4)),
quarter = as.numeric(str_sub(period,
# Use - to work from the end of the string
start = -1L, end = -1L))) %>%
# Remove the original period variable
select(-period)
Exercise 5
- Load in the OBR’s quarterly labour market data (sheet 1.6), keeping the following variables:
- Period
- Employment rate (%)
- Average earning growth (%)
- Average earning index
- Productivity per hour index
- Real wage product
- Real consumption wage
Split the period data into separate year and quarter variables, ensure that all variable names follow Tidyverse’s style guide. Name this object labour_market
.
- Load the OBR’s national living wage data (sheet 1.14), keep as an object named
living_wage
.
The function paste0
is a base R function that combines arguments separated by commas into a single string without spaces. For example,
paste0("abc", 1, 10, "_2010")
[1] "abc110_2010"
6.3 Transforming data
The labour market and housing market data are currently considered in what is known as long format, with many rows and fewer variables. The alternative to this format, wide format, can be seen in the living wage data, which has many variables and very few (only one!) row. Sometimes we may wish to convert between these variables, either to join them to other datasets (as is the case here), or to carry out an analysis or visualisation that requires a certain format. These conversions are carried out using the pivot_longer
or pivot_wider
functions.
There are many ways to pivot data within R (see the helpfile ?pivot_longer
for a full list of arguments), and the setup of this function tends to differ for every situation. For worked examples and a more detailed explanation of the function’s capabilities, enter vignette("pivot")
into the console.
For our data, we will need to convert the wide-format living wage data to a long-format so we are able to join it to the other data. This will create a new dataset with 2 variables: year
and living_wage
, with a row per year. The year
variable will be taken from the wide data names, and the living_wage
variable will come from the wide data values:
<- living_wage %>%
living_wage_long # First, select the columns that we wish to pivot (all of them)
pivot_longer(cols = everything(),
# Move the old variable names to a new year variable
names_to = "year",
# Remove the prefix from the old variable names
names_prefix = "year_",
# Convert the new year variable to numeric
names_transform = as.numeric,
# Take the old values and create a new living_wage variable
values_to = "living_wage")
Exercise 6
- Combine all three OBR datasets (housing market, labour market and living wage) together to create one complete dataset,
obr_data
.