%>%
ehs_tidy filter(!is.na(weekly_rent), !is.na(weekly_mortgage)) %>%
count()
# A tibble: 1 × 1
n
<int>
1 105
How many respondents had both weekly rent and mortgage payments given? What are the potential reasons for this?
We can combine the filter
and count
functions to answer the first part of this question:
%>%
ehs_tidy filter(!is.na(weekly_rent), !is.na(weekly_mortgage)) %>%
count()
# A tibble: 1 × 1
n
<int>
1 105
There were 105 respondents with both weekly rent and mortgage payments.
To understand why this is, we could first view this data (or a summary of the data) to see what other characteristics these respondents share:
%>%
ehs_tidy filter(!is.na(weekly_rent), !is.na(weekly_mortgage)) %>%
summary()
id weighting tenure_type
Min. :2.022e+10 Min. : 305.5 housing association: 0
1st Qu.:2.022e+10 1st Qu.: 628.9 local authority : 0
Median :2.022e+10 Median : 1058.4 owner occupied :105
Mean :2.022e+10 Mean : 1859.3 private rented : 0
3rd Qu.:2.022e+10 3rd Qu.: 2535.4
Max. :2.022e+10 Max. :10568.5
region gross_income length_residence weekly_rent
South East :22 Min. : 9880 two years :27 Min. : 8.30
East :19 1st Qu.: 26595 one year :19 1st Qu.: 49.80
London :19 Median : 39655 3-4 years :15 Median : 69.69
North West :13 Mean : 41908 5-9 years :15 Mean : 86.23
South West :13 3rd Qu.: 52323 less than 1 year: 9 3rd Qu.:120.00
East Midlands: 8 Max. :100000 10-19 years : 8 Max. :219.23
(Other) :11 (Other) :12
weekly_mortgage freehold_leasehold
Min. : 0.0231 freehold :29
1st Qu.: 0.0231 leasehold:65
Median : 60.0000 NA's :11
Mean : 71.6271
3rd Qu.:103.8462
Max. :343.8138
All respondents in this group owned and lived in their own home. Most were leasehold properties, suggesting some of the weekly rent refers to lease payments. Other potential reasons could include shared ownership (which is not given as an option for tenure type), or respondents that lived with renters in the same property.
Combine the weekly rent and mortgage variables into a single weekly payment variable.
Where only one value has been recorded, we want to use this in the new variable. Where both have been recorded, we will need to add the values together to get a weekly total.
There are a few different ways to do this. The first is to include an if_else
statement in the mutate
function, changing how the variable is calculated whether either value is missing or not:
<- ehs_tidy %>%
ehs_tidy_ex4 # if either weekly_rent or weekly_mortgage are missing
mutate(weekly_total = if_else(is.na(weekly_rent) | is.na(weekly_mortgage),
# return the non-missing value
coalesce(weekly_rent, weekly_mortgage),
# if neither are missing, add the variables together
+ weekly_mortgage)) weekly_rent
An alternative would be to use rowwise
which groups data by rows and carries out operations across rows rather than columns. This can be combined with the sum
function which contains the optional na.rm
argument to remove missing values:
<- ehs_tidy %>%
ehs_tidy_ex4 # group data by rows
rowwise() %>%
# add totals by using the sum function which has the argument na.rm to remove missing data
mutate(weekly_total = sum(weekly_rent, weekly_mortgage, na.rm = TRUE))
In this case, we cannot use the rowwise
option as it replaces missing values with zeros. Some of the households missing both rent and mortgage values may be a result of an incomplete survey rather than no payments. To avoid this, we will use if_else
to create the newest variable.
Create a summary table containing the mean, median, standard deviation, and the upper and lower quartiles of the weekly payment (rent and mortgage combined) for each region. What, if anything, can you infer about the distribution of this variable based on the table?
%>%
ehs_tidy_ex4 # Give summaries by region
group_by(region) %>%
# List summaries required
summarise(mean_payment = mean(weekly_total, na.rm = TRUE),
median_payment = median(weekly_total, na.rm = TRUE),
sd_payment = sd(weekly_total, na.rm = TRUE),
lq_payment = quantile(weekly_total, .25, na.rm = TRUE),
uq_payment = quantile(weekly_total, .75, na.rm = TRUE)) %>%
# Be sure to remove the grouping!
ungroup()
# A tibble: 9 × 6
region mean_payment median_payment sd_payment lq_payment uq_payment
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 East 172. 138. 135. 101. 208.
2 East Midlands 144. 121. 100. 92.3 162.
3 London 268. 214. 226. 127. 346.
4 North East 107. 92.3 52.0 80 115.
5 North West 121. 104. 69.2 84 138.
6 South East 202. 166. 153. 115 245.
7 South West 148. 127 93.6 93.2 183.
8 West Midlands 131. 106. 100. 86.3 150
9 Yorkshire and th… 118. 99.2 103. 79.8 133.
There are big differences between most mean and medians across regions, indicating that the data are not normally distributed. If we use the approximate 95% range formula (mean \(\pm\) (2 \(\times\) sd)), we would get negative values for all regions. Negative payments do not make sense in this context, confirming that the data are not normally distributed.
In this case, the median and IQR should be give, not the mean and standard deviation.