list.files(path = "data")
[1] "Detailed_forecast_tables_Economy_March_2024.xlsx"
[2] "generalfs21_EUL.sav"
[3] "interviewfs21_EUL.sav"
You have been provided with another .sav file which contains the interview responses from the EHS. Create and save a tidy version of this dataset, ensuring variables are classified as the correct type and names follow the style conventions (if you cannot remember these, check here for a reminder.
The variables we need in the tidy dataset are:
serialanon
HYEARGRx
lenresb
rentwkx
and mortgage mortwkx
paymentsfreeLeas
The first step is to load in the data. However, before this, we need the name of the file. We can look into our documents to get this, but the list.files
function will do it for us:
list.files(path = "data")
[1] "Detailed_forecast_tables_Economy_March_2024.xlsx"
[2] "generalfs21_EUL.sav"
[3] "interviewfs21_EUL.sav"
From the console, we can copy and paste the file name into the read_spss
function:
<- read_spss("data/interviewfs21_EUL.sav") ehs_interview_tidy
Next, we can select
just the variables we need to reduce the data size:
<- read_spss("data/interviewfs21_EUL.sav") %>%
ehs_interview_tidy select(serialanon, HYEARGRx, lenresb, rentwkx, mortwkx, freeLeas)
We can now explore this data to see which variables need converting, and which are truly numeric:
str(ehs_interview_tidy)
tibble [9,752 × 6] (S3: tbl_df/tbl/data.frame)
$ serialanon: dbl+lbl [1:9752] 2.02e+10, 2.02e+10, 2.02e+10, 2.02e+10, 2.02e+10, 2.0...
..@ label : chr "Key variable: unique archived identifier"
..@ format.spss : chr "F11.0"
..@ display_width: int 13
..@ labels : Named num [1:2] -9 -8
.. ..- attr(*, "names")= chr [1:2] "Does not apply" "No Answer"
$ HYEARGRx : dbl+lbl [1:9752] 38378, 26525, 25273, 51280, 14365, 38955, 2137...
..@ label : chr "Household gross annual income (inc. income from all adult household members)"
..@ format.spss : chr "F8.2"
..@ display_width: int 10
..@ labels : Named num 1e+05
.. ..- attr(*, "names")= chr "£100,000 or more"
$ lenresb : dbl+lbl [1:9752] 7, 8, 6, 4, 8, 8, 6, 2, 6, 8, 6, 4, 8, 5, 4, 5, 4, 3,...
..@ label : chr "Length of residence"
..@ format.spss : chr "F8.0"
..@ display_width: int 10
..@ labels : Named num [1:10] -9 -8 1 2 3 4 5 6 7 8
.. ..- attr(*, "names")= chr [1:10] "does not apply" "no answer" "less than 1 year" "one year" ...
$ rentwkx : dbl+lbl [1:9752] NA, NA, NA, NA, NA, NA, NA, N...
..@ label : chr "Total weekly rent payable (rent plus housing benefit)"
..@ format.spss : chr "F8.2"
..@ display_width: int 10
..@ labels : Named num -9
.. ..- attr(*, "names")= chr "question not applicable - owner occupier and not shared ownership"
$ mortwkx : dbl+lbl [1:9752] NA, NA, NA, 184.6, 88.8, NA, NA, 313....
..@ label : chr "Weekly mortgage payments"
..@ format.spss : chr "F8.2"
..@ display_width: int 10
..@ labels : Named num [1:3] -9 -8 0
.. ..- attr(*, "names")= chr [1:3] "not applicable - tenant" "unknown" "no payments - own outright"
$ freeLeas : dbl+lbl [1:9752] 1, 1, 1, 1, 1, 1, 1, 1, NA, 1, 1, NA, 1, N...
..@ label : chr "Freehold or leasehold"
..@ format.spss : chr "F8.0"
..@ display_width: int 10
..@ labels : Named num [1:4] -9 -8 1 2
.. ..- attr(*, "names")= chr [1:4] "does not apply" "no answer" "freehold" "leasehold"
- attr(*, "label")= chr "Aggregated File"
As with the general data, all variables are classified as dbl + lbl
by R. Of these, the length of residence and freehold/leashold variables appear to by categorical. There are also labels attached to the gross annual income (for those over £100,000) which we need to be aware of when analysing this data.
Therefore, our next step will involve converting the categorical variables into factors:
<- read_spss("data/interviewfs21_EUL.sav") %>%
ehs_interview_tidy select(serialanon, HYEARGRx, lenresb, rentwkx, mortwkx, freeLeas) %>%
mutate(length_residence = as_factor(lenresb),
freehold_leasehold = as_factor(freeLeas))
head(ehs_interview_tidy)
# A tibble: 6 × 8
serialanon HYEARGRx lenresb rentwkx mortwkx freeLeas length_residence
<dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+l> <dbl+l> <dbl+lb> <fct>
1 20220000001 38378. 7 [20-29 year… NA NA 1 [free… 20-29 years
2 20220000005 26525 8 [30+ years] NA NA 1 [free… 30+ years
3 20220000006 25272. 6 [10-19 year… NA NA 1 [free… 10-19 years
4 20220000012 51280. 4 [3-4 years] NA 185. 1 [free… 3-4 years
5 20220000013 14365 8 [30+ years] NA 88.8 1 [free… 30+ years
6 20220000017 38955 8 [30+ years] NA NA 1 [free… 30+ years
# ℹ 1 more variable: freehold_leasehold <fct>
Finally, we need to rename the existing variables to ensure they are informative and follow the style rules, and remove any unnecessary variables:
<- read_spss("data/interviewfs21_EUL.sav") %>%
ehs_interview_tidy select(serialanon, HYEARGRx, lenresb, rentwkx, mortwkx, freeLeas) %>%
mutate(length_residence = as_factor(lenresb),
freehold_leasehold = as_factor(freeLeas)) %>%
rename(id = serialanon,
gross_income = HYEARGRx,
weekly_rent = rentwkx,
weekly_mortgage = mortwkx) %>%
select(id, gross_income, length_residence, weekly_rent, weekly_mortgage,
freehold_leasehold)
Save the tidy interview dataset as a csv file with an appropriate file name.
write_csv(ehs_interview_tidy, file = "saved_data/ehs_interview_tidy.csv")
Using the new, tidy dataset, answer the following questions:
For the first two part, we use the filter
function to return a subgroup matching the condition, and combine this with the count
function that counts the number of rows in a tibble:
%>%
ehs_interview_tidy filter(between(weekly_rent, 150, 300)) %>%
count()
# A tibble: 1 × 1
n
<int>
1 993
%>%
ehs_interview_tidy filter(is.na(weekly_rent), is.na(weekly_mortgage)) %>%
count()
# A tibble: 1 × 1
n
<int>
1 2956
There were 993 respondents that paid weekly rent of between £150 and £300.
There were 2956 respondents that did not give a response to either the weekly rent or mortgage question.
The final part could usually be carried out with the base R max
function:
max(ehs_interview_tidy$gross_income)
<labelled<double>[1]>: Household gross annual income (inc. income from all adult household members)
[1] 1e+05
Labels:
value label
1e+05 £100,000 or more
However, the labels attached to the SPSS file showed that a value of 100000 actually represents a group of responders earning at least £100,000. Therefore, we cannot answer this question from the available data. If we were to analyse this variable, we would need to categorise the rest of the data, losing a lot of information. Failure to do this would produce invalid results.