09: Further issues in analysis of cohort studies

Author

Thiago Cerqueira Silva

Show the code

library(haven) 
# create table
library(gtsummary)
# data wrangling and other functions
library(tidyverse)
# if you want to have a output similar to stata
library(tidylog)
#cox
library(survival)
# plots survival
library(ggsurvfit)
# Limit significant digits to 2, remove scientific notation
options(digits = 2, scipen = 9)

diet <- read_dta("datasets/DIETLSH.dta") |> 
  mutate(across(where(is.labelled),as_factor))

diet <- diet |> 
  mutate(hieng = factor(hieng,
                                levels = c(0, 1),
                                labels = c("normal", "high-energy")),
                 fibre_cat = cut_number(fibre, 3))

Q1

Use the data in dietlsh.dta to estimate the effect of hieng on chd when the time scale is set equal to time since entry using stcox.

Show the code

# Create Cox model
cox_mod <- coxph(Surv(time = fup, event=chd) ~ hieng, data = diet)

# Calculating 95% CIs for HRs
tbl_regression(cox_mod, exponentiate = T) |> 
  add_n(location="level") |> 
  add_nevent(location="level")

Characteristic	N	Event N	HR	95% CI	p-value
hieng
normal	155	28	—	—
high-energy	182	18	0.52	0.29, 0.95	0.032
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio

Q2

Examine the proportionality assumption using a Nelson-Aalen plot. Does the proportionality assumption appear to be valid?

Show the code

survfit2(Surv(time = fup, event=chd) ~ hieng, data = diet) |> 
  ggsurvfit(
    type = "cumhaz"
  )+
  scale_y_log10()

Answer

On this time scale the effect of hieng appears to be proportional as the two lines are quite parallel.

Coding explanation

Nelson-Aalen plot is the plot in the cumulative hazard. The ggsurvfit function accept multiple statistics: “survival” (KM plot), “risk” (Cumulative incidence), “cumhaz” (NA plot), “cloglog” (complimentary log-log survival). It is necessary to transform the Y axis to the log scale. (The PH assumption is in log)

Q3

Does an interaction test confirm your conclusions?

Show the code

cox.zph(coxph(Surv(time = fup, event=chd) ~ hieng, data = diet))

       chisq df    p
hieng  0.881  1 0.35
GLOBAL 0.881  1 0.35

Show the code

plot(cox.zph(coxph(Surv(time = fup, event=chd) ~ hieng, data = diet)))

Answer

There is little evidence that the effect of hieng is not proportional using time in the study.

Coding explanation

The answer in STATA praticals is using stsplit-spliting the dataset by time since fup- and doing a interaction. In R, we have cox.zph which fits a model with interaction between time and each covariate in the model. And it is possible to visualize the time-varying coefficient

Q4/Q5

Re-examine Question 1 and Question 2 using attained age as the time-scale. Is the proportionality assumption valid? What are the problems with assessing proportionality using graphical methods when using age as the time-scale?

Assess the proportionality assumption by including an interaction between time period and hieng.

Show the code

cox_mod <- coxph(Surv(time = as.numeric(doe), 
                      time2 = as.numeric(dox),
                      origin = as.numeric(dob),
                      event=chd) ~ hieng, data = diet)
tbl_regression(cox_mod, exponentiate = T) |> 
  add_n(location="level") |> 
  add_nevent(location="level")

Characteristic	N	Event N	HR	95% CI	p-value
hieng
normal	155	28	—	—
high-energy	182	18	0.54	0.30, 0.98	0.043
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio

NA plots

Show the code

survfit2(Surv(time = as.numeric(doe)/365.25, 
                      time2 = as.numeric(dox)/365.25,
                      origin = as.numeric(dob)/365.25,
                      event=chd) ~ hieng, data = diet) |> 
  ggsurvfit(
    type = "cumhaz"
  )+
  scale_y_log10()

Interaction plots

Show the code

cox.zph(cox_mod)

       chisq df   p
hieng  0.698  1 0.4
GLOBAL 0.698  1 0.4

Show the code

plot(cox.zph(cox_mod))

Answer

The estimated effect of hieng has not really changed but the effect does not appear to be proportional on this scale. However, in using attained age as the time-scale there are few individuals at the beginning of follow-up (the median age for entry into the study was 48.8 years). Therefore when early events occur they are based on few individuals and so the hazard estimates are less reliable here which impacts on the cumulative event rates over the course of follow-up. For this reason Nelson-Aalen plots are most useful when there is no delayed entry e.g. when the time-scale is time since entry into the study.

Q6

Create a categorical variable (fibcat3) containing the fibre content of the diet in thirds. Examine the effect of fibcat3, first using time since entry and then attained age as time-scale.

Show the code

cox_mod_fup <- coxph(Surv(time = fup, 
                      event=chd) ~ fibre_cat, data = diet)
tbl_fup <- tbl_regression(cox_mod_fup, exponentiate = T) |> 
  add_n(location="level") |> 
  add_nevent(location="level")

cox_mod_age <- coxph(Surv(time = as.numeric(doe), 
                      time2 = as.numeric(dox),
                      origin = as.numeric(dob),
                      event=chd) ~ fibre_cat, data = diet)
tbl_age <- tbl_regression(cox_mod_age, exponentiate = T) |> 
  add_n(location="level") |> 
  add_nevent(location="level")

tbl_merge(list(
  tbl_fup,
  tbl_age
  ),
         tab_spanner = c("**Time since entry**", "**Age**") )

Characteristic	Time since entry					Age
Characteristic	N	Event N	HR	95% CI	p-value	N	Event N	HR	95% CI	p-value
fibre_cat
[0.605,1.45]	112	21	—	—		112	21	—	—
(1.45,1.83]	110	17	0.82	0.43, 1.56	0.6	110	17	0.79	0.42, 1.50	0.5
(1.83,5.35]	111	7	0.29	0.12, 0.68	0.005	111	7	0.29	0.12, 0.69	0.005
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio

Assess proportionality

Time since entry

Show the code

cox.zph(cox_mod_fup)

          chisq df    p
fibre_cat 0.251  2 0.88
GLOBAL    0.251  2 0.88

Show the code

plot(cox.zph(cox_mod_fup))

Age

Show the code

cox.zph(cox_mod_fup)

          chisq df    p
fibre_cat 0.251  2 0.88
GLOBAL    0.251  2 0.88

Show the code

plot(cox.zph(cox_mod_fup))

Q7

Use the PBC data set: pbc1bas.dta. Set the time and failure variables as in ASME 7, section 3 (i.e. with time in the study as the time scale) and then examine separately the effects of treatment, bilirubin and cirrhosis. What assumptions are you making? Fit a multivariate model with these three variables. What happens to the estimated rate ratios? Is this the most appropriate time scale to use?

Show the code

pbc <- read_dta("datasets/PBC1BAS.DTA")

# Data management
pbc<- pbc |> mutate(death = d,
               treat = factor(treat, levels = c(1, 2), labels = c("placebo", "azath")),
               cenc0 = factor(cenc0, levels = c(0, 1), labels = c("no", "yes")),
               cir0 = factor(cir0, levels = c(0, 1), labels = c("no", "yes")),
               gh0 = factor(gh0, levels = c(0, 1), labels = c("no", "yes")),
               asc0 = factor(asc0, levels = c(0, 1), labels = c("no", "yes"))) |> 
         select(-d)

Show the code

mod_tto <- coxph(Surv(time,death) ~ treat, data = pbc)
mod_tto1 <- coxph(Surv(time,death) ~ treat+logb0, data = pbc)
mod_tto2 <- coxph(Surv(time,death) ~ treat+logb0+cir0, data = pbc)

tbl_tto <- tbl_regression(mod_tto, exponentiate = T) |> 
  add_n(location="level") |> 
  add_nevent(location="level")
tbl_tto1 <- tbl_regression(mod_tto1, exponentiate = T) |> 
  add_n(location="level") |> 
  add_nevent(location="level")
tbl_tto2 <- tbl_regression(mod_tto2, exponentiate = T) |> 
  add_n(location="level") |> 
  add_nevent(location="level")
tbl_merge(list(tbl_tto,tbl_tto1,tbl_tto2),
          tab_spanner = c("**Treat**", "**Treat+bili**","**Treat+bili+cirr**"))

Characteristic	Treat					Treat+bili					Treat+bili+cirr
Characteristic	N	Event N	HR	95% CI	p-value	N	Event N	HR	95% CI	p-value	N	Event N	HR	95% CI	p-value
treat
placebo	90	49	—	—		90	49	—	—		90	49	—	—
azath	94	47	0.86	0.57, 1.28	0.5	94	47	0.65	0.43, 0.99	0.044	94	47	0.63	0.42, 0.96	0.031
Lob Bilirubin at entry						184	96	10.6	6.42, 17.6	<0.001	184	96	11.4	6.68, 19.6	<0.001
cir0
no											131	58	—	—
yes											53	38	2.50	1.64, 3.82	<0.001
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio

Answer

We find that there is little evidence for an effect of treatment (HR=0.86 (95% CI 0.57 to 1.28), p=0.45). The assumption is that this effect is constant over the follow-up time period.

Adjusting for a prognostic factor (bilirubin) which was not evenly distributed between the two treatment arms at baseline we obtain, a treatment HR=0.65 (95% CI 0.43 to 0.99). Further adjustment for presence of cirrhosis at entry into the trial has little effect on the estimated effect of treatment HR= 0.63 (0.42 to 0.96)

The underlying assumption is that treatment reduces mortality by about 37% over the entire follow-up period, once the effects of bilirubin and cirrhosis at entry into the trial are taken into account. Time in study (i.e. time since randomisation) would be the appropriate time scale to use as these are data from a randomised controlled trial.