8: Cox Regression

Author

Thiago Cerqueira Silva

Show the code
# read "dta" files
library(haven) 
# create table
library(gtsummary)
# data wrangling and other functions
library(tidyverse)
# if you want to have a output similar to stata
library(tidylog)
#cox
library(survival)
# plots survival
library(ggsurvfit)
# Limit significant digits to 2, remove scientific notation
options(digits = 2, scipen = 9)

# Import the dataset
trin <- read_dta("datasets/TRINMLSH.DTA")

#This contains data from a cohort study on cardiovascular risk factors and mortality among ~300 men from Trinidad.
# Some data wrangling

trin <- trin |> 
  mutate(
    ethgp = factor(ethgp,
                      levels = c(1:5),
                      labels = c("African", "Indian", "European", "mixed", "Chin/Sem")),
    alc = factor(alc,
                      levels = c(0:3),
                      labels = c("none", "1-4/wk", "5-14/wk", ">=15/wk")),
    smoke3 = case_when(
      smokenum == 0 ~ "non-smok",
      smokenum == 1 ~ "ex-smok",
      !is.na(smokenum) ~ "smoker"
    ),
    smoke3 = fct_relevel(smoke3, "non-smok"),
    smokenum = factor(smokenum,
                      levels = c(0:5),
                      labels = c("non-smok", "ex-smok", "1-9/d", "10-19/d", "20-29/d", ">=30/d")),
    chdstart = factor(chdstart,
                      levels = c(0, 1),
                      labels = c("no", "yes")
    ))

Q1

The variables timein and timeout hold the dates of entry and exit into the study, while death is the indicator for mortality from any cause. Use stset with these variables, setting timein to be the origin (i.e. sort the records according to follow-up time, as in Figure 4).

You don’t need to do any of this (stset) in R. Each regression can have its own time setting!

Q2

Examine the effect of smoking on all-cause mortality using strate (or stptime if you prefer) and stcox. You may wish to recode smokenum into a smaller number of categories, say: 0=non-smoker, 1=ex-smoker, 2=current smoker.

We can now examine the smoking-specific mortality rates (per 1,000 person-years). Let’s first use the classical technique and then let’s use Cox regression.

Show the code
# Calculate rates
pyears(Surv(time = years, event=death) ~ smoke3, data = trin, scale = 1) %>%
  summary(n = F, rate = T, ci.r = T, scale = 1000)

Call: pyears(formula = Surv(time = years, event = death) ~ smoke3, data = trin, scale = 1)

number of observations = 317

smoke3 Events Time Event rate CI (rate)
non-smok 30 984 30 21 - 44
ex-smok 18 462 39 23 - 62
smoker 40 749 53 38 - 73
Show the code
# Note that Surv() can take time data in two different formats: either a combination of data of entry and data of exit, or as a time # difference. In this case, `years` codes this time difference, so we'll use it.

# Create Cox model
cox_smok3 <- coxph(Surv(time = years, event=death) ~ smoke3, data = trin)

# Calculating 95% CIs for HRs
tbl_regression(cox_smok3, exponentiate = T) |> 
  add_n(location="level") |> 
  add_nevent(location="level")
Characteristic N Event N HR 95% CI p-value
smoke3




    non-smok 140 30
    ex-smok 68 18 1.29 0.72, 2.32 0.4
    smoker 109 40 1.75 1.09, 2.81 0.021
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio
  • N is the size of the group
  • Event N is the number of events in that group

Q3

Using the rates and rate ratios in part (2), can you see a trend in the relationship between smoking category and all-cause mortality? How would you assess this more formally? How would you present the results from your analyses in parts (2) and (3)?

Show the code
# Create Cox model
cox_smok3 <- coxph(Surv(time = years, event=death) ~ as.numeric(smoke3), data = trin)

# Calculating 95% CIs for HRs
tbl_regression(cox_smok3, exponentiate = T) |> 
  add_n(location="level") |> 
  add_nevent(location="level")
Characteristic N Event N HR 95% CI p-value
as.numeric(smoke3) 317 88 1.32 1.04, 1.68 0.020
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio

You can assess the trend formally by treating the smoke variable as numeric (ordinal in this case), so the model will evaluate if the increase in each point is associated with increase in the hazard.

Q4

Consider the possible confounding effect of current age on the relationship between smoking and mortality. By examining the data (but without performing any further modelling) would you expect a Cox model controlling for current age to give different results to one controlling for time in the study? Now perform a Cox model to investigate this.

Note that in question 2 the Cox model has adjusted for time in the study i.e. everyone is followed up from the time they enter the study and when an event occurs the comparison is with others who have the same amount of follow-up and not with others of the same age. If there are differences in age between the three groups then we would expect an analysis controlling for current age to show confounding (since mortality increases with age). One indicator of this would be the age at entry in the three groups.

Show the code
trin |> 
  tbl_summary(include = ageent, by = smoke3) |> 
  modify_caption("Age by smoke status")
Age by smoke status
Characteristic non-smok
N = 1401
ex-smok
N = 681
smoker
N = 1091
Age in years at first survey 64.52 (62.32, 67.68) 65.32 (62.47, 68.73) 64.40 (62.48, 66.44)
1 Median (Q1, Q3)
Show the code
# Survival object set for current age
# Cox using the timeage as the time 
cox_smok3_age <- coxph(Surv(time=as.numeric(timein),time2=as.numeric(timeout),
                             event = death,
                             origin = as.numeric(timebth)) ~ smoke3, data = trin)

tbl_regression(cox_smok3_age, exponentiate = T) |> 
  add_n(location="level") |> 
  add_nevent(location="level")
Characteristic N Event N HR 95% CI p-value
smoke3




    non-smok 140 30
    ex-smok 68 18 1.28 0.71, 2.29 0.4
    smoker 109 40 1.77 1.10, 2.85 0.018
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio

Q5

Using PBC1BAS dataset

Show the code
# some data wrangling
# Read in the pbc1bas.dta dataset 
pbc <- read_dta("datasets/PBC1BAS.DTA")

# Data management
pbc<- pbc |> mutate(death = d,
               treat = factor(treat, levels = c(1, 2), labels = c("placebo", "azath")),
               cenc0 = factor(cenc0, levels = c(0, 1), labels = c("no", "yes")),
               cir0 = factor(cir0, levels = c(0, 1), labels = c("no", "yes")),
               gh0 = factor(gh0, levels = c(0, 1), labels = c("no", "yes")),
               asc0 = factor(asc0, levels = c(0, 1), labels = c("no", "yes"))) |> 
         select(-d)

Use a Poisson model to assess the relationship between treatment and mortality adjusting for baseline bilirubin (logb0). How do the results compare to those from using a Cox model?

Show the code
# Poisson model
mod_pois <- glm(death ~ offset(log(time)) + treat + logb0, family = poisson, data = pbc) 

tbl_pois <- tbl_regression(mod_pois,exponentiate = T)

# Cox model
mod_cox <- coxph(Surv(time,death) ~ treat + logb0, data = pbc)
tbl_cox <- tbl_regression(mod_cox,exponentiate = T)

tbl_merge(list(tbl_pois,tbl_cox))
Characteristic
Table 1
Table 2
IRR 95% CI p-value HR 95% CI p-value
treat





    placebo

    azath 0.73 0.49, 1.10 0.13 0.65 0.43, 0.99 0.044
Lob Bilirubin at entry 7.23 4.64, 11.3 <0.001 10.6 6.42, 17.6 <0.001
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio, IRR = Incidence Rate Ratio
Show the code
tbl_merge(list(tbl_pois,tbl_cox),
          tab_spanner = c("**Poisson**", "**Cox**"))
Characteristic
Poisson
Cox
IRR 95% CI p-value HR 95% CI p-value
treat





    placebo

    azath 0.73 0.49, 1.10 0.13 0.65 0.43, 0.99 0.044
Lob Bilirubin at entry 7.23 4.64, 11.3 <0.001 10.6 6.42, 17.6 <0.001
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio, IRR = Incidence Rate Ratio

The results from the Cox model provide more evidence for an impact of treatment (from the hazard ratios, 95% CIs and the p-values). However, the Cox model adjusts for time in the study as well as baseline bilirubin whereas the Poisson model adjusts for baseline bilirubin only.

Q6

In order to make a like for like comparison with the Cox model, what would be a more appropriate analysis using streg (survSplit in R / other functions (see Epi package)) and what do the results from this analysis indicate?

Show the code
# Split the survival object
pbc_split <- survSplit(Surv(time,death) ~ .,
                       data = pbc,
                       cut = c(2, 4, 6),
                       episode = "period",
                       start = "tstart",
                       end = "tstop")

# Fix age and create follow-up by band
pbc_split <- pbc_split |> 
  mutate(
    age = age + tstart,
    t_fup = tstop - tstart
  )


# Factorise variable and label values
pbc_split$period <- factor(pbc_split$period,
                           levels = c(1, 2, 3, 4),
                           labels = c("0-2y", "2-4y", "4-6y", "6-12y"))
glm(death ~ offset(log(t_fup)) + treat + logb0 + period,   # period goes here
    family = poisson(),
    data = pbc_split) |> 
  tbl_regression(exponentiate = T)
Characteristic IRR 95% CI p-value
treat


    placebo
    azath 0.64 0.42, 0.96 0.032
logb0 11.1 6.78, 18.4 <0.001
period


    0-2y
    2-4y 1.21 0.70, 2.05 0.5
    4-6y 2.46 1.38, 4.28 0.002
    6-12y 4.67 2.39, 8.84 <0.001
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio

If you split the time in a more granular way, the results will be even closer!

Show the code
# Split the survival object
pbc_split <- survSplit(Surv(time,death) ~ .,
                       data = pbc,
                       cut = c(2,3,4,5,6,7,8,9),
                       episode = "period",
                       start = "tstart",
                       end = "tstop")

# Fix age and create follow-up by band
pbc_split <- pbc_split |> 
  mutate(
    age = age + tstart,
    t_fup = tstop - tstart
  )


# Factorise variable and label values
pbc_split$period <- as.factor(pbc_split$period)
glm(death ~ offset(log(t_fup)) + treat + logb0 + period,   # period goes here
    family = poisson(),
    data = pbc_split) |> 
  tbl_regression(exponentiate = T, include = treat)
Characteristic IRR 95% CI p-value
treat


    placebo
    azath 0.65 0.43, 0.98 0.040
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio

Q7

Use a Cox regression model to estimate the effect of grade (coded: 1=high; 2=low) using first the follow-up time scale. Then check whether current age is a confounder using both Cox and Poisson regression.

Show the code
whitehall <- read_stata("datasets/WHITEHAL.DTA")

# Factorise job grade
whitehall$grade <- factor(whitehall$grade,
                          levels = c(1, 2),
                          labels = c("higher", "lower"))
Show the code
# Cox
cox_time <- coxph(Surv(
  as.numeric(timein),
  as.numeric(timeout),
  event = chd
) ~ grade, data = whitehall)

tbl_time <- tbl_regression(cox_time, exponentiate = T)
# Cox
cox_age <- coxph(Surv(
  as.numeric(timein),
  as.numeric(timeout),
  event = chd,
  origin = as.numeric(timebth)
) ~ grade, data = whitehall)
tbl_age <- tbl_regression(cox_age, exponentiate = T)

tbl_merge(list(tbl_time,tbl_age),
         tab_spanner = c("**Time since followup**", "**Age**") )
Characteristic
Time since followup
Age
HR 95% CI p-value HR 95% CI p-value
grade





    higher

    lower 2.04 1.48, 2.81 <0.001 1.41 1.01, 1.95 0.043
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio
Show the code
# Split the survival object
split_white <- survSplit(Surv(
  as.numeric(timein),
  as.numeric(timeout),
  event = chd,
  origin = as.numeric(timebth)
) ~ .,
                         data = whitehall,
                         cut = c(50*365.25, 60*365.25, 70*365.25,
                                 80*365.25),
                         episode = "age",
start = "tstart",
end = "tstop")

# Factorise variable and label values
split_white$age <- factor(split_white$age,
                          levels = c(1, 2, 3, 4,5),
                          labels = c("<=50", "51-60", "61-70", 
                                     "71-80",
                                     ">80"))

split_white <- split_white  |> 
  mutate(
    pyears=as.numeric(tstop - tstart))

#Fit a Poisson model
glm(chd ~ offset(log(pyears)) + grade + age,   
    family = poisson(),
    data = split_white) |> 
  tbl_regression(exponentiate = T)
Characteristic IRR 95% CI p-value
grade


    higher
    lower 1.45 1.04, 2.01 0.027
age


    <=50
    51-60 7.67 1.63, 137 0.045
    61-70 22.1 4.90, 390 0.002
    71-80 36.6 7.95, 649 <0.001
    >80 87.7 12.9, 1,723 <0.001
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio

Optional

Think whether you should test for interaction between grade and current age. If so how would you do this using a Poisson model and what are the problems of assessing such an interaction using a Cox model?

Show the code
#Fit a Poisson model
poisson_without <- glm(chd ~ offset(log(pyears)) + grade + age,   
    family = poisson(),
    data = split_white)
poisson_interaction <- glm(chd ~ offset(log(pyears)) + grade * age,   
    family = poisson(),
    data = split_white)
lmtest::lrtest(poisson_without,poisson_interaction)
#Df LogLik Df Chisq Pr(>Chisq)
6 -748 NA NA NA
10 -741 4 13 0.01

You can also test that with cox regression. If the model is using age as the underlying timescale. The survival packages has inbuilt function for that. The cox.zph function will test proportionality of all the predictors in the model by creating interactions with time using the transformation of time (default is log) specified in the transform option. And you can also plot the effect by time

Show the code
cox.zph(cox_age)
       chisq df     p
grade   6.23  1 0.013
GLOBAL  6.23  1 0.013
Show the code
plot(cox.zph(cox_age))