09: Further issues in analysis of cohort studies

Author

Thiago Cerqueira Silva

Show the code
library(haven) 
# create table
library(gtsummary)
# data wrangling and other functions
library(tidyverse)
# if you want to have a output similar to stata
library(tidylog)
#cox
library(survival)
# plots survival
library(ggsurvfit)
# Limit significant digits to 2, remove scientific notation
options(digits = 2, scipen = 9)

diet <- read_dta("datasets/DIETLSH.dta") |> 
  mutate(across(where(is.labelled),as_factor))

diet <- diet |> 
  mutate(hieng = factor(hieng,
                                levels = c(0, 1),
                                labels = c("normal", "high-energy")),
                 fibre_cat = cut_number(fibre, 3))

Q1

Use the data in dietlsh.dta to estimate the effect of hieng on chd when the time scale is set equal to time since entry using stcox.

Show the code
# Create Cox model
cox_mod <- coxph(Surv(time = fup, event=chd) ~ hieng, data = diet)

# Calculating 95% CIs for HRs
tbl_regression(cox_mod, exponentiate = T) |> 
  add_n(location="level") |> 
  add_nevent(location="level")
Characteristic N Event N HR 95% CI p-value
hieng




    normal 155 28
    high-energy 182 18 0.52 0.29, 0.95 0.032
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio

Q2

Examine the proportionality assumption using a Nelson-Aalen plot. Does the proportionality assumption appear to be valid?

Show the code
survfit2(Surv(time = fup, event=chd) ~ hieng, data = diet) |> 
  ggsurvfit(
    type = "cumhaz"
  )+
  scale_y_log10()

On this time scale the effect of hieng appears to be proportional as the two lines are quite parallel.

Nelson-Aalen plot is the plot in the cumulative hazard. The ggsurvfit function accept multiple statistics: “survival” (KM plot), “risk” (Cumulative incidence), “cumhaz” (NA plot), “cloglog” (complimentary log-log survival). It is necessary to transform the Y axis to the log scale. (The PH assumption is in log)

Q3

Does an interaction test confirm your conclusions?

Show the code
cox.zph(coxph(Surv(time = fup, event=chd) ~ hieng, data = diet))
       chisq df    p
hieng  0.881  1 0.35
GLOBAL 0.881  1 0.35
Show the code
plot(cox.zph(coxph(Surv(time = fup, event=chd) ~ hieng, data = diet)))

There is little evidence that the effect of hieng is not proportional using time in the study.

The answer in STATA praticals is using stsplit-spliting the dataset by time since fup- and doing a interaction. In R, we have cox.zph which fits a model with interaction between time and each covariate in the model. And it is possible to visualize the time-varying coefficient

Q4/Q5

Re-examine Question 1 and Question 2 using attained age as the time-scale. Is the proportionality assumption valid? What are the problems with assessing proportionality using graphical methods when using age as the time-scale?

Assess the proportionality assumption by including an interaction between time period and hieng.

Show the code
cox_mod <- coxph(Surv(time = as.numeric(doe), 
                      time2 = as.numeric(dox),
                      origin = as.numeric(dob),
                      event=chd) ~ hieng, data = diet)
tbl_regression(cox_mod, exponentiate = T) |> 
  add_n(location="level") |> 
  add_nevent(location="level")
Characteristic N Event N HR 95% CI p-value
hieng




    normal 155 28
    high-energy 182 18 0.54 0.30, 0.98 0.043
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio

NA plots

Show the code
survfit2(Surv(time = as.numeric(doe)/365.25, 
                      time2 = as.numeric(dox)/365.25,
                      origin = as.numeric(dob)/365.25,
                      event=chd) ~ hieng, data = diet) |> 
  ggsurvfit(
    type = "cumhaz"
  )+
  scale_y_log10()

Interaction plots

Show the code
cox.zph(cox_mod)
       chisq df   p
hieng  0.698  1 0.4
GLOBAL 0.698  1 0.4
Show the code
plot(cox.zph(cox_mod))

The estimated effect of hieng has not really changed but the effect does not appear to be proportional on this scale. However, in using attained age as the time-scale there are few individuals at the beginning of follow-up (the median age for entry into the study was 48.8 years). Therefore when early events occur they are based on few individuals and so the hazard estimates are less reliable here which impacts on the cumulative event rates over the course of follow-up. For this reason Nelson-Aalen plots are most useful when there is no delayed entry e.g. when the time-scale is time since entry into the study.

Q6

Create a categorical variable (fibcat3) containing the fibre content of the diet in thirds. Examine the effect of fibcat3, first using time since entry and then attained age as time-scale.

Show the code
cox_mod_fup <- coxph(Surv(time = fup, 
                      event=chd) ~ fibre_cat, data = diet)
tbl_fup <- tbl_regression(cox_mod_fup, exponentiate = T) |> 
  add_n(location="level") |> 
  add_nevent(location="level")

cox_mod_age <- coxph(Surv(time = as.numeric(doe), 
                      time2 = as.numeric(dox),
                      origin = as.numeric(dob),
                      event=chd) ~ fibre_cat, data = diet)
tbl_age <- tbl_regression(cox_mod_age, exponentiate = T) |> 
  add_n(location="level") |> 
  add_nevent(location="level")

tbl_merge(list(
  tbl_fup,
  tbl_age
  ),
         tab_spanner = c("**Time since entry**", "**Age**") )
Characteristic
Time since entry
Age
N Event N HR 95% CI p-value N Event N HR 95% CI p-value
fibre_cat









    [0.605,1.45] 112 21
112 21
    (1.45,1.83] 110 17 0.82 0.43, 1.56 0.6 110 17 0.79 0.42, 1.50 0.5
    (1.83,5.35] 111 7 0.29 0.12, 0.68 0.005 111 7 0.29 0.12, 0.69 0.005
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio

Assess proportionality

Time since entry

Show the code
cox.zph(cox_mod_fup)
          chisq df    p
fibre_cat 0.251  2 0.88
GLOBAL    0.251  2 0.88
Show the code
plot(cox.zph(cox_mod_fup))

Age

Show the code
cox.zph(cox_mod_fup)
          chisq df    p
fibre_cat 0.251  2 0.88
GLOBAL    0.251  2 0.88
Show the code
plot(cox.zph(cox_mod_fup))

Q7

Use the PBC data set: pbc1bas.dta. Set the time and failure variables as in ASME 7, section 3 (i.e. with time in the study as the time scale) and then examine separately the effects of treatment, bilirubin and cirrhosis. What assumptions are you making? Fit a multivariate model with these three variables. What happens to the estimated rate ratios? Is this the most appropriate time scale to use?

Show the code
pbc <- read_dta("datasets/PBC1BAS.DTA")

# Data management
pbc<- pbc |> mutate(death = d,
               treat = factor(treat, levels = c(1, 2), labels = c("placebo", "azath")),
               cenc0 = factor(cenc0, levels = c(0, 1), labels = c("no", "yes")),
               cir0 = factor(cir0, levels = c(0, 1), labels = c("no", "yes")),
               gh0 = factor(gh0, levels = c(0, 1), labels = c("no", "yes")),
               asc0 = factor(asc0, levels = c(0, 1), labels = c("no", "yes"))) |> 
         select(-d)
Show the code
mod_tto <- coxph(Surv(time,death) ~ treat, data = pbc)
mod_tto1 <- coxph(Surv(time,death) ~ treat+logb0, data = pbc)
mod_tto2 <- coxph(Surv(time,death) ~ treat+logb0+cir0, data = pbc)

tbl_tto <- tbl_regression(mod_tto, exponentiate = T) |> 
  add_n(location="level") |> 
  add_nevent(location="level")
tbl_tto1 <- tbl_regression(mod_tto1, exponentiate = T) |> 
  add_n(location="level") |> 
  add_nevent(location="level")
tbl_tto2 <- tbl_regression(mod_tto2, exponentiate = T) |> 
  add_n(location="level") |> 
  add_nevent(location="level")
tbl_merge(list(tbl_tto,tbl_tto1,tbl_tto2),
          tab_spanner = c("**Treat**", "**Treat+bili**","**Treat+bili+cirr**"))
Characteristic
Treat
Treat+bili
Treat+bili+cirr
N Event N HR 95% CI p-value N Event N HR 95% CI p-value N Event N HR 95% CI p-value
treat














    placebo 90 49
90 49
90 49
    azath 94 47 0.86 0.57, 1.28 0.5 94 47 0.65 0.43, 0.99 0.044 94 47 0.63 0.42, 0.96 0.031
Lob Bilirubin at entry




184 96 10.6 6.42, 17.6 <0.001 184 96 11.4 6.68, 19.6 <0.001
cir0














    no









131 58
    yes









53 38 2.50 1.64, 3.82 <0.001
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio

We find that there is little evidence for an effect of treatment (HR=0.86 (95% CI 0.57 to 1.28), p=0.45). The assumption is that this effect is constant over the follow-up time period.

Adjusting for a prognostic factor (bilirubin) which was not evenly distributed between the two treatment arms at baseline we obtain, a treatment HR=0.65 (95% CI 0.43 to 0.99). Further adjustment for presence of cirrhosis at entry into the trial has little effect on the estimated effect of treatment HR= 0.63 (0.42 to 0.96)

The underlying assumption is that treatment reduces mortality by about 37% over the entire follow-up period, once the effects of bilirubin and cirrhosis at entry into the trial are taken into account. Time in study (i.e. time since randomisation) would be the appropriate time scale to use as these are data from a randomised controlled trial.