15: Analysis of quantitative data

Author

Thiago Cerqueira Silva

This computer practical uses the whitehal.dta data set, a subset of data from the Whitehall cohort study. It contains data on risk factors for ischaemic heart disease collected in male civil servants between 1967-69. The variables we are interested in were all collected at entry to the study and so the data can be thought of as cross-sectional. Systolic blood pressure (sbp) is the quantitative outcome, and the aim of the practical is to quantify the relationship between sbp and occupational grade (grade4).

These are the variables of interest: * Outcome: systolic blood pressure (continuous: sbp) * Exposure: job grade (categorical: grade4) * Confounder: age (continuous: agein)

Note that in the data grade4 is coded 1=admin; 2=professional/executive; 3=clerical; 4=other

Show the code

library(haven) 
# create table
library(gtsummary)
# data wrangling and other functions
library(tidyverse)
# if you want to have a output similar to stata
library(tidylog)
#cox
library(survival)

# Limit significant digits to 3, remove scientific notation (most of the time, scipen=999 removes all time)
options(digits = 3, scipen = 9)

Q1

Find the mean (systolic) blood pressure by categories of grade4

Show the code

white <- read_dta("datasets/WHITEHAL.DTA") |> 
  mutate(across(where(is.labelled),as_factor))

Show the code

white |> 
  mutate(grade4 = as.factor(grade4)) |> 
  group_by(grade4) |> 
  summarise(mean_sbp = mean(sbp),
            sd_sbp = sd(sbp),
            median_sbp = median(sbp),
            iqr_sbp = IQR(sbp))

grade4	mean_sbp	sd_sbp	median_sbp	iqr_sbp
1	133	19.1	132	22.0
2	135	19.5	133	27.0
3	139	23.9	136	31.0
4	139	24.8	137	34.5

Q2

Estimate the regression model for sbp with grade categories as explanatory variables. What is the expected sbp with 95% confidence intervals:

for a man in grade 1?
for a man in grade 4?

Compare these estimates to your results for question 1.

Show the code

white <- white |> 
  mutate(grade4 = as.factor(grade4))

lm_model <- lm(sbp ~ grade4, data=white)

marginaleffects::avg_predictions(lm_model, variables = "grade4") |> as_tibble()

grade4	estimate	std.error	statistic	s.value	conf.low	conf.high
1	133	2.271	58.7	Inf	129	138
2	135	0.629	214.2	Inf	133	136
3	139	1.193	116.7	Inf	137	142
4	139	1.583	87.5	Inf	135	142

Show the code

# you can also get direct results fitting a model without intercept
#lm_no_intercept <- lm(sbp ~ -1 + grade4, data=white)
#broom::tidy(lm_no_intercept, conf.int=T)

Coding explanation

The code marginaleffects::avg_predictions generates the prediction of the model by each level of the variable defined in “variables”. Also, if you fit a model without intercept, it will provide the same answers (in this case without other variables)

Q3

Using the output of the model you fitted in question 2, do you have evidence to suggest that systolic blood pressure is associated with occupational grade?

Show the code

empty_model <-lm(sbp ~ 1, data = white)
anova(empty_model, lm_model)

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
1676	740195	NA	NA	NA	NA
1673	733272	3	6923	5.26	0.001

Answer

The anova (F test) indicates that compared to the null model, the new model with 3 more parameters (Df=3) has significantly improved the fit (F=5.27 and p=0.0013)

Q4

Study the association between age at entry (agein) and sbp. Some of the tools you may wish to use are scatter plots, summaries of sbp by categories of agein, and regression models treating agein as linear, quadratic or categorical.

Show the code

white |> 
  ggplot(aes(agein, sbp))+
  geom_point()+
  geom_smooth() # add a LOESS - Locally estimated scatterplot smoothing - curve

Show the code

# recode age

white <- white |> 
  mutate(
    agegroup = case_when(
      agein>40 & agein<=45 ~ "41-45",
      agein<=50 ~ "46-50",
      agein <= 55 ~ "51-55",
      agein <= 60 ~ "56-60",
      agein <= 70 ~ "61-70"
    )
  )

white |> 
  group_by(agegroup) |> 
    summarise(mean_sbp = mean(sbp),
            sd_sbp = sd(sbp),
            median_sbp = median(sbp),
            iqr_sbp = IQR(sbp))

agegroup	mean_sbp	sd_sbp	median_sbp	iqr_sbp
41-45	130	17.2	129	22.0
46-50	132	19.4	130	26.0
51-55	135	19.0	132	25.8
56-60	139	22.2	137	29.0
61-70	146	24.3	144	31.2

Show the code

# Models
#Linear
lm_lin <- lm(sbp ~ agein, data = white)
#marginaleffects::plot_predictions(lm_lin, condition = "agein")
# Quadratic
lm_qd <- lm(sbp ~ agein + I(agein*agein), data = white)

marginaleffects::plot_predictions(lm_qd, condition = "agein")+
  geom_point(data = white,aes(agein,sbp), alpha=0.1, color="blue")+
  labs(title="Prediction - Age squared")

Show the code

# compare models
anova(lm_lin, lm_qd)

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
1675	697619	NA	NA	NA	NA
1674	694908	1	2711	6.53	0.011

Answer

The F-test indicate that including the quadratic term improves the fit.

Q5

Before looking at the data, do you think that age may confound the relationship between grade4 and sbp? Now fit a linear regression model to find out if age is a confounder or not. Is there evidence to suggest that occupational grade is independently associated with systolic blood pressure (i.e. after adjustment for age)? Write a brief conclusion of your findings about the association between occupational grade and systolic blood pressure.

Show the code

lm_model_f <- lm(sbp ~ grade4 + agein +I(agein*agein), data=white)
broom::tidy(lm_model_f, conf.int=T)

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	167.975	28.80	5.832	0.000	111.485	224.466
grade42	0.374	2.30	0.163	0.871	-4.135	4.883
grade43	1.482	2.54	0.583	0.560	-3.502	6.466
grade44	-0.211	2.76	-0.076	0.939	-5.617	5.195
agein	-2.000	1.10	-1.813	0.070	-4.164	0.164
I(agein * agein)	0.026	0.01	2.479	0.013	0.005	0.046

Show the code

anova(lm_model_f,
      lm(sbp ~ agein + I(agein*agein), white))

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
1671	694491	NA	NA	NA	NA
1674	694908	-3	-417	0.335	0.8

Show the code

white |> 
  ggplot(aes(grade4,agein, color=grade4))+
  geom_boxplot()

Answer

Our conclusion is that systolic blood pressure appears to be associated with occupational grade but that this is only as a result of lower grade workers (coded 3, 4) being, on average, older. The real association is between age and systolic blood pressure.

Q6

Obtain standardised residuals of your model and use these to check that the model assumptions are reasonable.

Show the code

white <- white |> 
  mutate(std_res = rstandard(lm_model_f))

white |> 
  ggplot(aes(std_res))+
  geom_histogram(color="white")

Show the code

white |> 
  ggplot()+
 geom_qq(aes(sample=std_res)) +
  geom_abline(color = "red") +
  coord_fixed()

Show the code

# you can get a overall of diagnostics plots using performance::check_model
#performance::check_model(lm_model_f) # try yourself

Answer

A histogram of the residuals and the Normal quantile plot both show some positive skew, but not too bad considering the size of the sample. It isn’t necessary to plot both a histogram and a Normal quantile plot but both are shown here for demonstration.

The residuals versus fitted plot shows no obvious relationship between the residuals and the fitted values, and the variance seems to be constant over the fitted values (perhaps some increasing variability but difficult to say). A log-transformation of the outcome variable can remove the skew in the distribution of the residuals - see question 7.

Q7

You probably found that the model residuals that you obtained in question 6 looked a bit skewed. A log-transformation of the outcome variable often removes skew.

Generate a new variable of log-transformed sbp
Estimate the regression model for lsbp with grade categories as explanatory variables, equivalent to the model you fitted in question 2.

Show the code

white <- white |> 
  mutate(log_sbp = log(sbp))
lm_model_log <- lm(log_sbp ~ grade4, data=white)
broom::tidy(lm_model_log, conf.int=T)

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	4.882	0.016	299.814	0.000	4.850	4.914
grade42	0.010	0.017	0.601	0.548	-0.023	0.043
grade43	0.040	0.018	2.156	0.031	0.004	0.076
grade44	0.034	0.020	1.703	0.089	-0.005	0.073

Obtain standardised residuals of this model and use these to check that the model assumptions are reasonable. Are the assumptions more appropriate now that you have log-transformed sbp?

Show the code

white <- white |> 
  mutate(std_res_log = rstandard(lm_model_log))

white |> 
  ggplot(aes(std_res_log))+
  geom_histogram(color="white")

Show the code

white |> 
  ggplot()+
 geom_qq(aes(sample=std_res_log)) +
  geom_abline(color = "red") +
  coord_fixed()

Answer

The residuals are more symmetrical now that sbp has been log-transformed. The plot of the residuals versus fitted shows no obvious pattern, and the variance of the residuals does not seem to be very different across grades. The model assumptions are now more appropriate. Note that there are now only 4 different fitted values, because the only explanatory variables in the model are categories of grade. In question 6 the fitted values depended on grade and age and were therefore continuous.

The parameter estimates from a linear regression model with a log-transformed outcome can be back-transformed by taking the exponential of the estimate. Compute the back-transformed estimate for occupational grade 4 compared to occupational grade 1

Show the code

broom::tidy(lm_model_log, conf.int=T, exponentiate=T)

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	131.94	0.016	299.814	0.000	127.791	136.22
grade42	1.01	0.017	0.601	0.548	0.977	1.04
grade43	1.04	0.018	2.156	0.031	1.004	1.08
grade44	1.03	0.020	1.703	0.089	0.995	1.07

Answer

The back-transformed estimate for grade 4 compared to grade 1 is 1.034. This means that sbp is expected to be 3.4% higher (or 1.034 times higher) in workers in grade 4 compared to workers in grade 1

\[ log(sbp) = a + \beta_2*grade_2+\beta_3*grade_3+\beta_4*grade_4 \newline \text{for }grade_{4}\text{ the other betas will be 0, so} log(sbp) = a + \beta_4*grade_4 \newline log(sbp) - a = \beta_4*grade_4 \newline a = log(sbp|grade_1) \newline log(sbp) - log(sbp|grade_1) = \beta_4*grade_4 \newline grade_4=1 \newline log(sbp) - log(sbp|grade_1) = \beta_4 \newline log((sbp|grade_4)/(sbp|grade_1)) = \beta_4 \newline (sbp|grade_4)/(sbp|grade_1) = exp(\beta_4) \]