15: Analysis of quantitative data

Author

Thiago Cerqueira Silva

This computer practical uses the whitehal.dta data set, a subset of data from the Whitehall cohort study. It contains data on risk factors for ischaemic heart disease collected in male civil servants between 1967-69. The variables we are interested in were all collected at entry to the study and so the data can be thought of as cross-sectional. Systolic blood pressure (sbp) is the quantitative outcome, and the aim of the practical is to quantify the relationship between sbp and occupational grade (grade4).

These are the variables of interest: * Outcome: systolic blood pressure (continuous: sbp) * Exposure: job grade (categorical: grade4) * Confounder: age (continuous: agein)

Note that in the data grade4 is coded 1=admin; 2=professional/executive; 3=clerical; 4=other

Show the code
library(haven) 
# create table
library(gtsummary)
# data wrangling and other functions
library(tidyverse)
# if you want to have a output similar to stata
library(tidylog)
#cox
library(survival)

# Limit significant digits to 3, remove scientific notation (most of the time, scipen=999 removes all time)
options(digits = 3, scipen = 9)

Q1

Find the mean (systolic) blood pressure by categories of grade4

Show the code
white <- read_dta("datasets/WHITEHAL.DTA") |> 
  mutate(across(where(is.labelled),as_factor))
Show the code
white |> 
  mutate(grade4 = as.factor(grade4)) |> 
  group_by(grade4) |> 
  summarise(mean_sbp = mean(sbp),
            sd_sbp = sd(sbp),
            median_sbp = median(sbp),
            iqr_sbp = IQR(sbp))
grade4 mean_sbp sd_sbp median_sbp iqr_sbp
1 133 19.1 132 22.0
2 135 19.5 133 27.0
3 139 23.9 136 31.0
4 139 24.8 137 34.5

Q2

Estimate the regression model for sbp with grade categories as explanatory variables. What is the expected sbp with 95% confidence intervals:

  1. for a man in grade 1?
  2. for a man in grade 4?

Compare these estimates to your results for question 1.

Show the code
white <- white |> 
  mutate(grade4 = as.factor(grade4))

lm_model <- lm(sbp ~ grade4, data=white)

marginaleffects::avg_predictions(lm_model, variables = "grade4") |> as_tibble()
grade4 estimate std.error statistic p.value s.value conf.low conf.high
1 133 2.271 58.7 0 Inf 129 138
2 135 0.629 214.2 0 Inf 133 136
3 139 1.193 116.7 0 Inf 137 142
4 139 1.583 87.5 0 Inf 135 142
Show the code
# you can also get direct results fitting a model without intercept
#lm_no_intercept <- lm(sbp ~ -1 + grade4, data=white)
#broom::tidy(lm_no_intercept, conf.int=T)

The code marginaleffects::avg_predictions generates the prediction of the model by each level of the variable defined in “variables”. Also, if you fit a model without intercept, it will provide the same answers (in this case without other variables)

Q3

Using the output of the model you fitted in question 2, do you have evidence to suggest that systolic blood pressure is associated with occupational grade?

Show the code
empty_model <-lm(sbp ~ 1, data = white)
anova(empty_model, lm_model)
Res.Df RSS Df Sum of Sq F Pr(>F)
1676 740195 NA NA NA NA
1673 733272 3 6923 5.26 0.001

The anova (F test) indicates that compared to the null model, the new model with 3 more parameters (Df=3) has significantly improved the fit (F=5.27 and p=0.0013)

Q4

Study the association between age at entry (agein) and sbp. Some of the tools you may wish to use are scatter plots, summaries of sbp by categories of agein, and regression models treating agein as linear, quadratic or categorical.

Show the code
white |> 
  ggplot(aes(agein, sbp))+
  geom_point()+
  geom_smooth() # add a LOESS - Locally estimated scatterplot smoothing - curve

Show the code
# recode age

white <- white |> 
  mutate(
    agegroup = case_when(
      agein>40 & agein<=45 ~ "41-45",
      agein<=50 ~ "46-50",
      agein <= 55 ~ "51-55",
      agein <= 60 ~ "56-60",
      agein <= 70 ~ "61-70"
    )
  )

white |> 
  group_by(agegroup) |> 
    summarise(mean_sbp = mean(sbp),
            sd_sbp = sd(sbp),
            median_sbp = median(sbp),
            iqr_sbp = IQR(sbp))
agegroup mean_sbp sd_sbp median_sbp iqr_sbp
41-45 130 17.2 129 22.0
46-50 132 19.4 130 26.0
51-55 135 19.0 132 25.8
56-60 139 22.2 137 29.0
61-70 146 24.3 144 31.2
Show the code
# Models
#Linear
lm_lin <- lm(sbp ~ agein, data = white)
#marginaleffects::plot_predictions(lm_lin, condition = "agein")
# Quadratic
lm_qd <- lm(sbp ~ agein + I(agein*agein), data = white)

marginaleffects::plot_predictions(lm_qd, condition = "agein")+
  geom_point(data = white,aes(agein,sbp), alpha=0.1, color="blue")+
  labs(title="Prediction - Age squared")

Show the code
# compare models
anova(lm_lin, lm_qd)
Res.Df RSS Df Sum of Sq F Pr(>F)
1675 697619 NA NA NA NA
1674 694908 1 2711 6.53 0.011

The F-test indicate that including the quadratic term improves the fit.

Q5

Before looking at the data, do you think that age may confound the relationship between grade4 and sbp? Now fit a linear regression model to find out if age is a confounder or not. Is there evidence to suggest that occupational grade is independently associated with systolic blood pressure (i.e. after adjustment for age)? Write a brief conclusion of your findings about the association between occupational grade and systolic blood pressure.

Show the code
lm_model_f <- lm(sbp ~ grade4 + agein +I(agein*agein), data=white)
broom::tidy(lm_model_f, conf.int=T)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 167.975 28.80 5.832 0.000 111.485 224.466
grade42 0.374 2.30 0.163 0.871 -4.135 4.883
grade43 1.482 2.54 0.583 0.560 -3.502 6.466
grade44 -0.211 2.76 -0.076 0.939 -5.617 5.195
agein -2.000 1.10 -1.813 0.070 -4.164 0.164
I(agein * agein) 0.026 0.01 2.479 0.013 0.005 0.046
Show the code
anova(lm_model_f,
      lm(sbp ~ agein + I(agein*agein), white))
Res.Df RSS Df Sum of Sq F Pr(>F)
1671 694491 NA NA NA NA
1674 694908 -3 -417 0.335 0.8
Show the code
white |> 
  ggplot(aes(grade4,agein, color=grade4))+
  geom_boxplot()

Our conclusion is that systolic blood pressure appears to be associated with occupational grade but that this is only as a result of lower grade workers (coded 3, 4) being, on average, older. The real association is between age and systolic blood pressure.

Q6

Obtain standardised residuals of your model and use these to check that the model assumptions are reasonable.

Show the code
white <- white |> 
  mutate(std_res = rstandard(lm_model_f))

white |> 
  ggplot(aes(std_res))+
  geom_histogram(color="white")

Show the code
white |> 
  ggplot()+
 geom_qq(aes(sample=std_res)) +
  geom_abline(color = "red") +
  coord_fixed()

Show the code
# you can get a overall of diagnostics plots using performance::check_model
#performance::check_model(lm_model_f) # try yourself

A histogram of the residuals and the Normal quantile plot both show some positive skew, but not too bad considering the size of the sample. It isn’t necessary to plot both a histogram and a Normal quantile plot but both are shown here for demonstration.

The residuals versus fitted plot shows no obvious relationship between the residuals and the fitted values, and the variance seems to be constant over the fitted values (perhaps some increasing variability but difficult to say). A log-transformation of the outcome variable can remove the skew in the distribution of the residuals - see question 7.

Q7

You probably found that the model residuals that you obtained in question 6 looked a bit skewed. A log-transformation of the outcome variable often removes skew.

  1. Generate a new variable of log-transformed sbp

  2. Estimate the regression model for lsbp with grade categories as explanatory variables, equivalent to the model you fitted in question 2.

Show the code
white <- white |> 
  mutate(log_sbp = log(sbp))
lm_model_log <- lm(log_sbp ~ grade4, data=white)
broom::tidy(lm_model_log, conf.int=T)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 4.882 0.016 299.814 0.000 4.850 4.914
grade42 0.010 0.017 0.601 0.548 -0.023 0.043
grade43 0.040 0.018 2.156 0.031 0.004 0.076
grade44 0.034 0.020 1.703 0.089 -0.005 0.073
  1. Obtain standardised residuals of this model and use these to check that the model assumptions are reasonable. Are the assumptions more appropriate now that you have log-transformed sbp?
Show the code
white <- white |> 
  mutate(std_res_log = rstandard(lm_model_log))

white |> 
  ggplot(aes(std_res_log))+
  geom_histogram(color="white")

Show the code
white |> 
  ggplot()+
 geom_qq(aes(sample=std_res_log)) +
  geom_abline(color = "red") +
  coord_fixed()

The residuals are more symmetrical now that sbp has been log-transformed. The plot of the residuals versus fitted shows no obvious pattern, and the variance of the residuals does not seem to be very different across grades. The model assumptions are now more appropriate. Note that there are now only 4 different fitted values, because the only explanatory variables in the model are categories of grade. In question 6 the fitted values depended on grade and age and were therefore continuous.

  1. The parameter estimates from a linear regression model with a log-transformed outcome can be back-transformed by taking the exponential of the estimate. Compute the back-transformed estimate for occupational grade 4 compared to occupational grade 1
Show the code
broom::tidy(lm_model_log, conf.int=T, exponentiate=T)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 131.94 0.016 299.814 0.000 127.791 136.22
grade42 1.01 0.017 0.601 0.548 0.977 1.04
grade43 1.04 0.018 2.156 0.031 1.004 1.08
grade44 1.03 0.020 1.703 0.089 0.995 1.07

The back-transformed estimate for grade 4 compared to grade 1 is 1.034. This means that sbp is expected to be 3.4% higher (or 1.034 times higher) in workers in grade 4 compared to workers in grade 1

\[ log(sbp) = a + \beta_2*grade_2+\beta_3*grade_3+\beta_4*grade_4 \newline \text{for }grade_{4}\text{ the other betas will be 0, so} log(sbp) = a + \beta_4*grade_4 \newline log(sbp) - a = \beta_4*grade_4 \newline a = log(sbp|grade_1) \newline log(sbp) - log(sbp|grade_1) = \beta_4*grade_4 \newline grade_4=1 \newline log(sbp) - log(sbp|grade_1) = \beta_4 \newline log((sbp|grade_4)/(sbp|grade_1)) = \beta_4 \newline (sbp|grade_4)/(sbp|grade_1) = exp(\beta_4) \]