Appendix B — Exercise 4 solutions

B.1 Question 1

How many respondents had both weekly rent and mortgage payments given? What are the potential reasons for this?

Solution

We can combine the filter and count functions to answer the first part of this question:

ehs_tidy %>% 
  filter(!is.na(weekly_rent), !is.na(weekly_mortgage)) %>% 
  count()
# A tibble: 1 × 1
      n
  <int>
1   105

There were 105 respondents with both weekly rent and mortgage payments.

To understand why this is, we could first view this data (or a summary of the data) to see what other characteristics these respondents share:

ehs_tidy %>% 
  filter(!is.na(weekly_rent), !is.na(weekly_mortgage)) %>% 
  summary()
       id              weighting                    tenure_type 
 Min.   :2.022e+10   Min.   :  305.5   housing association:  0  
 1st Qu.:2.022e+10   1st Qu.:  628.9   local authority    :  0  
 Median :2.022e+10   Median : 1058.4   owner occupied     :105  
 Mean   :2.022e+10   Mean   : 1859.3   private rented     :  0  
 3rd Qu.:2.022e+10   3rd Qu.: 2535.4                            
 Max.   :2.022e+10   Max.   :10568.5                            
                                                                
           region    gross_income            length_residence  weekly_rent    
 South East   :22   Min.   :  9880   two years       :27      Min.   :  8.30  
 East         :19   1st Qu.: 26595   one year        :19      1st Qu.: 49.80  
 London       :19   Median : 39655   3-4 years       :15      Median : 69.69  
 North West   :13   Mean   : 41908   5-9 years       :15      Mean   : 86.23  
 South West   :13   3rd Qu.: 52323   less than 1 year: 9      3rd Qu.:120.00  
 East Midlands: 8   Max.   :100000   10-19 years     : 8      Max.   :219.23  
 (Other)      :11                    (Other)         :12                      
 weekly_mortgage    freehold_leasehold
 Min.   :  0.0231   freehold :29      
 1st Qu.:  0.0231   leasehold:65      
 Median : 60.0000   NA's     :11      
 Mean   : 71.6271                     
 3rd Qu.:103.8462                     
 Max.   :343.8138                     
                                      

All respondents in this group owned and lived in their own home. Most were leasehold properties, suggesting some of the weekly rent refers to lease payments. Other potential reasons could include shared ownership (which is not given as an option for tenure type), or respondents that lived with renters in the same property.

B.2 Question 2

Combine the weekly rent and mortgage variables into a single weekly payment variable.

Solution

Where only one value has been recorded, we want to use this in the new variable. Where both have been recorded, we will need to add the values together to get a weekly total.

There are a few different ways to do this. The first is to include an if_else statement in the mutate function, changing how the variable is calculated whether either value is missing or not:

ehs_tidy_ex4 <- ehs_tidy %>% 
  # if either weekly_rent or weekly_mortgage are missing
  mutate(weekly_total = if_else(is.na(weekly_rent) | is.na(weekly_mortgage), 
                                # return the non-missing value
                                coalesce(weekly_rent, weekly_mortgage), 
                                # if neither are missing, add the variables together
                                weekly_rent + weekly_mortgage))

An alternative would be to use rowwise which groups data by rows and carries out operations across rows rather than columns. This can be combined with the sum function which contains the optional na.rm argument to remove missing values:

ehs_tidy_ex4 <- ehs_tidy %>% 
  # group data by rows
  rowwise() %>% 
  # add totals by using the sum function which has the argument na.rm to remove missing data
  mutate(weekly_total = sum(weekly_rent, weekly_mortgage, na.rm = TRUE))
Warning

In this case, we cannot use the rowwise option as it replaces missing values with zeros. Some of the households missing both rent and mortgage values may be a result of an incomplete survey rather than no payments. To avoid this, we will use if_else to create the newest variable.

B.3 Question 3

Create a summary table containing the mean, median, standard deviation, and the upper and lower quartiles of the weekly payment (rent and mortgage combined) for each region. What, if anything, can you infer about the distribution of this variable based on the table?

Solution

ehs_tidy_ex4 %>% 
  # Give summaries by region
  group_by(region) %>% 
  # List summaries required
  summarise(mean_payment = mean(weekly_total, na.rm = TRUE),
            median_payment = median(weekly_total, na.rm = TRUE),
            sd_payment = sd(weekly_total, na.rm = TRUE),
            lq_payment = quantile(weekly_total, .25, na.rm = TRUE),
            uq_payment = quantile(weekly_total, .75, na.rm = TRUE)) %>%
  # Be sure to remove the grouping!
  ungroup()
# A tibble: 9 × 6
  region            mean_payment median_payment sd_payment lq_payment uq_payment
  <fct>                    <dbl>          <dbl>      <dbl>      <dbl>      <dbl>
1 East                      172.          138.       135.       101.        208.
2 East Midlands             144.          121.       100.        92.3       162.
3 London                    268.          214.       226.       127.        346.
4 North East                107.           92.3       52.0       80         115.
5 North West                121.          104.        69.2       84         138.
6 South East                202.          166.       153.       115         245.
7 South West                148.          127         93.6       93.2       183.
8 West Midlands             131.          106.       100.        86.3       150 
9 Yorkshire and th…         118.           99.2      103.        79.8       133.

There are big differences between most mean and medians across regions, indicating that the data are not normally distributed. If we use the approximate 95% range formula (mean \(\pm\) (2 \(\times\) sd)), we would get negative values for all regions. Negative payments do not make sense in this context, confirming that the data are not normally distributed.

In this case, the median and IQR should be give, not the mean and standard deviation.