Appendix A — Exercise 3 solutions

A.1 Question 1

You have been provided with another .sav file which contains the interview responses from the EHS. Create and save a tidy version of this dataset, ensuring variables are classified as the correct type and names follow the style conventions (if you cannot remember these, check here for a reminder.

The variables we need in the tidy dataset are:

The unique identifier serialanon
The gross household income HYEARGRx
The length of residence lenresb
The weekly rent rentwkx and mortgage mortwkx payments
Whether the property is freehold or leasehold freeLeas

Solution

The first step is to load in the data. However, before this, we need the name of the file. We can look into our documents to get this, but the list.files function will do it for us:

list.files(path = "data")

[1] "Detailed_forecast_tables_Economy_March_2024.xlsx"
[2] "generalfs21_EUL.sav"                             
[3] "interviewfs21_EUL.sav"

From the console, we can copy and paste the file name into the read_spss function:

ehs_interview_tidy <- read_spss("data/interviewfs21_EUL.sav")

Next, we can select just the variables we need to reduce the data size:

ehs_interview_tidy <- read_spss("data/interviewfs21_EUL.sav") %>% 
  select(serialanon, HYEARGRx, lenresb, rentwkx, mortwkx, freeLeas)

We can now explore this data to see which variables need converting, and which are truly numeric:

str(ehs_interview_tidy)

tibble [9,752 × 6] (S3: tbl_df/tbl/data.frame)
 $ serialanon: dbl+lbl [1:9752] 2.02e+10, 2.02e+10, 2.02e+10, 2.02e+10, 2.02e+10, 2.0...
   ..@ label        : chr "Key variable: unique archived identifier"
   ..@ format.spss  : chr "F11.0"
   ..@ display_width: int 13
   ..@ labels       : Named num [1:2] -9 -8
   .. ..- attr(*, "names")= chr [1:2] "Does not apply" "No Answer"
 $ HYEARGRx  : dbl+lbl [1:9752]  38378,  26525,  25273,  51280,  14365,  38955,  2137...
   ..@ label        : chr "Household gross annual income (inc. income from all adult household members)"
   ..@ format.spss  : chr "F8.2"
   ..@ display_width: int 10
   ..@ labels       : Named num 1e+05
   .. ..- attr(*, "names")= chr "£100,000 or more"
 $ lenresb   : dbl+lbl [1:9752] 7, 8, 6, 4, 8, 8, 6, 2, 6, 8, 6, 4, 8, 5, 4, 5, 4, 3,...
   ..@ label        : chr "Length of residence"
   ..@ format.spss  : chr "F8.0"
   ..@ display_width: int 10
   ..@ labels       : Named num [1:10] -9 -8 1 2 3 4 5 6 7 8
   .. ..- attr(*, "names")= chr [1:10] "does not apply" "no answer" "less than 1 year" "one year" ...
 $ rentwkx   : dbl+lbl [1:9752]    NA,    NA,    NA,    NA,    NA,    NA,    NA,    N...
   ..@ label        : chr "Total weekly rent payable (rent plus housing benefit)"
   ..@ format.spss  : chr "F8.2"
   ..@ display_width: int 10
   ..@ labels       : Named num -9
   .. ..- attr(*, "names")= chr "question not applicable - owner occupier and not shared ownership"
 $ mortwkx   : dbl+lbl [1:9752]    NA,    NA,    NA, 184.6,  88.8,    NA,    NA, 313....
   ..@ label        : chr "Weekly mortgage payments"
   ..@ format.spss  : chr "F8.2"
   ..@ display_width: int 10
   ..@ labels       : Named num [1:3] -9 -8 0
   .. ..- attr(*, "names")= chr [1:3] "not applicable - tenant" "unknown" "no payments - own outright"
 $ freeLeas  : dbl+lbl [1:9752]  1,  1,  1,  1,  1,  1,  1,  1, NA,  1,  1, NA,  1, N...
   ..@ label        : chr "Freehold or leasehold"
   ..@ format.spss  : chr "F8.0"
   ..@ display_width: int 10
   ..@ labels       : Named num [1:4] -9 -8 1 2
   .. ..- attr(*, "names")= chr [1:4] "does not apply" "no answer" "freehold" "leasehold"
 - attr(*, "label")= chr "Aggregated File"

As with the general data, all variables are classified as dbl + lbl by R. Of these, the length of residence and freehold/leashold variables appear to by categorical. There are also labels attached to the gross annual income (for those over £100,000) which we need to be aware of when analysing this data.

Therefore, our next step will involve converting the categorical variables into factors:

ehs_interview_tidy <- read_spss("data/interviewfs21_EUL.sav") %>% 
  select(serialanon, HYEARGRx, lenresb, rentwkx, mortwkx, freeLeas) %>% 
  mutate(length_residence = as_factor(lenresb),
         freehold_leasehold = as_factor(freeLeas))

head(ehs_interview_tidy)

# A tibble: 6 × 8
  serialanon  HYEARGRx  lenresb        rentwkx mortwkx freeLeas length_residence
  <dbl+lbl>   <dbl+lbl> <dbl+lbl>      <dbl+l> <dbl+l> <dbl+lb> <fct>           
1 20220000001 38378.    7 [20-29 year… NA       NA     1 [free… 20-29 years     
2 20220000005 26525     8 [30+ years]  NA       NA     1 [free… 30+ years       
3 20220000006 25272.    6 [10-19 year… NA       NA     1 [free… 10-19 years     
4 20220000012 51280.    4 [3-4 years]  NA      185.    1 [free… 3-4 years       
5 20220000013 14365     8 [30+ years]  NA       88.8   1 [free… 30+ years       
6 20220000017 38955     8 [30+ years]  NA       NA     1 [free… 30+ years       
# ℹ 1 more variable: freehold_leasehold <fct>

Finally, we need to rename the existing variables to ensure they are informative and follow the style rules, and remove any unnecessary variables:

ehs_interview_tidy <- read_spss("data/interviewfs21_EUL.sav") %>% 
  select(serialanon, HYEARGRx, lenresb, rentwkx, mortwkx, freeLeas) %>% 
  mutate(length_residence = as_factor(lenresb),
         freehold_leasehold = as_factor(freeLeas)) %>% 
  rename(id = serialanon,
         gross_income = HYEARGRx,
         weekly_rent = rentwkx,
         weekly_mortgage = mortwkx) %>% 
  select(id, gross_income, length_residence, weekly_rent, weekly_mortgage, 
         freehold_leasehold)

A.2 Question 2

Save the tidy interview dataset as a csv file with an appropriate file name.

Solution

write_csv(ehs_interview_tidy, file = "saved_data/ehs_interview_tidy.csv")

A.3 Question 3

Using the new, tidy dataset, answer the following questions:

How many respondents paid weekly rent of between £150 and £300?
How many respondents did not give a response to either the weekly rent or weekly mortgage question?
What is the highest household gross income of these responders?

Solution

For the first two part, we use the filter function to return a subgroup matching the condition, and combine this with the count function that counts the number of rows in a tibble:

ehs_interview_tidy %>% 
  filter(between(weekly_rent, 150, 300)) %>% 
  count()

# A tibble: 1 × 1
      n
  <int>
1   993

ehs_interview_tidy %>% 
  filter(is.na(weekly_rent), is.na(weekly_mortgage)) %>% 
  count()

# A tibble: 1 × 1
      n
  <int>
1  2956

There were 993 respondents that paid weekly rent of between £150 and £300.

There were 2956 respondents that did not give a response to either the weekly rent or mortgage question.

The final part could usually be carried out with the base R max function:

max(ehs_interview_tidy$gross_income)

<labelled<double>[1]>: Household gross annual income (inc. income from all adult household members)
[1] 1e+05

Labels:
 value            label
 1e+05 £100,000 or more

However, the labels attached to the SPSS file showed that a value of 100000 actually represents a group of responders earning at least £100,000. Therefore, we cannot answer this question from the available data. If we were to analyse this variable, we would need to categorise the rest of the data, losing a lot of information. Failure to do this would produce invalid results.