# read "dta" fileslibrary(haven) # create tablelibrary(gtsummary)# data wrangling and other functionslibrary(tidyverse)# if you want to have a output similar to statalibrary(tidylog)#coxlibrary(survival)# plots survivallibrary(ggsurvfit)# Limit significant digits to 2, remove scientific notationoptions(digits =2, scipen =9)# Import the datasettrin <-read_dta("datasets/TRINMLSH.DTA")#This contains data from a cohort study on cardiovascular risk factors and mortality among ~300 men from Trinidad.# Some data wranglingtrin <- trin |>mutate(ethgp =factor(ethgp,levels =c(1:5),labels =c("African", "Indian", "European", "mixed", "Chin/Sem")),alc =factor(alc,levels =c(0:3),labels =c("none", "1-4/wk", "5-14/wk", ">=15/wk")),smoke3 =case_when( smokenum ==0~"non-smok", smokenum ==1~"ex-smok",!is.na(smokenum) ~"smoker" ),smoke3 =fct_relevel(smoke3, "non-smok"),smokenum =factor(smokenum,levels =c(0:5),labels =c("non-smok", "ex-smok", "1-9/d", "10-19/d", "20-29/d", ">=30/d")),chdstart =factor(chdstart,levels =c(0, 1),labels =c("no", "yes") ))
Q1
The variables timein and timeout hold the dates of entry and exit into the study, while death is the indicator for mortality from any cause. Use stset with these variables, setting timein to be the origin (i.e. sort the records according to follow-up time, as in Figure 4).
You don’t need to do any of this (stset) in R. Each regression can have its own time setting!
Q2
Examine the effect of smoking on all-cause mortality using strate (or stptime if you prefer) and stcox. You may wish to recode smokenum into a smaller number of categories, say: 0=non-smoker, 1=ex-smoker, 2=current smoker.
We can now examine the smoking-specific mortality rates (per 1,000 person-years). Let’s first use the classical technique and then let’s use Cox regression.
Call: pyears(formula = Surv(time = years, event = death) ~ smoke3, data = trin, scale = 1)
number of observations = 317
smoke3
Events
Time
Event rate
CI (rate)
non-smok
30
984
30
21 - 44
ex-smok
18
462
39
23 - 62
smoker
40
749
53
38 - 73
Show the code
# Note that Surv() can take time data in two different formats: either a combination of data of entry and data of exit, or as a time # difference. In this case, `years` codes this time difference, so we'll use it.# Create Cox modelcox_smok3 <-coxph(Surv(time = years, event=death) ~ smoke3, data = trin)# Calculating 95% CIs for HRstbl_regression(cox_smok3, exponentiate = T) |>add_n(location="level") |>add_nevent(location="level")
Characteristic
N
Event N
HR
95% CI
p-value
smoke3
non-smok
140
30
—
—
ex-smok
68
18
1.29
0.72, 2.32
0.4
smoker
109
40
1.75
1.09, 2.81
0.021
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio
N is the size of the group
Event N is the number of events in that group
Q3
Using the rates and rate ratios in part (2), can you see a trend in the relationship between smoking category and all-cause mortality? How would you assess this more formally? How would you present the results from your analyses in parts (2) and (3)?
Show the code
# Create Cox modelcox_smok3 <-coxph(Surv(time = years, event=death) ~as.numeric(smoke3), data = trin)# Calculating 95% CIs for HRstbl_regression(cox_smok3, exponentiate = T) |>add_n(location="level") |>add_nevent(location="level")
Characteristic
N
Event N
HR
95% CI
p-value
as.numeric(smoke3)
317
88
1.32
1.04, 1.68
0.020
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio
Answer
You can assess the trend formally by treating the smoke variable as numeric (ordinal in this case), so the model will evaluate if the increase in each point is associated with increase in the hazard.
Q4
Consider the possible confounding effect of current age on the relationship between smoking and mortality. By examining the data (but without performing any further modelling) would you expect a Cox model controlling for current age to give different results to one controlling for time in the study? Now perform a Cox model to investigate this.
Answer
Note that in question 2 the Cox model has adjusted for time in the study i.e. everyone is followed up from the time they enter the study and when an event occurs the comparison is with others who have the same amount of follow-up and not with others of the same age. If there are differences in age between the three groups then we would expect an analysis controlling for current age to show confounding (since mortality increases with age). One indicator of this would be the age at entry in the three groups.
Show the code
trin |>tbl_summary(include = ageent, by = smoke3) |>modify_caption("Age by smoke status")
Age by smoke status
Characteristic
non-smok N = 1401
ex-smok N = 681
smoker N = 1091
Age in years at first survey
64.52 (62.32, 67.68)
65.32 (62.47, 68.73)
64.40 (62.48, 66.44)
1 Median (Q1, Q3)
Show the code
# Survival object set for current age# Cox using the timeage as the time cox_smok3_age <-coxph(Surv(time=as.numeric(timein),time2=as.numeric(timeout),event = death,origin =as.numeric(timebth)) ~ smoke3, data = trin)tbl_regression(cox_smok3_age, exponentiate = T) |>add_n(location="level") |>add_nevent(location="level")
Characteristic
N
Event N
HR
95% CI
p-value
smoke3
non-smok
140
30
—
—
ex-smok
68
18
1.28
0.71, 2.29
0.4
smoker
109
40
1.77
1.10, 2.85
0.018
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio
Use a Poisson model to assess the relationship between treatment and mortality adjusting for baseline bilirubin (logb0). How do the results compare to those from using a Cox model?
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio, IRR = Incidence Rate Ratio
Answer
The results from the Cox model provide more evidence for an impact of treatment (from the hazard ratios, 95% CIs and the p-values). However, the Cox model adjusts for time in the study as well as baseline bilirubin whereas the Poisson model adjusts for baseline bilirubin only.
Q6
In order to make a like for like comparison with the Cox model, what would be a more appropriate analysis using streg (survSplit in R / other functions (see Epi package)) and what do the results from this analysis indicate?
Show the code
# Split the survival objectpbc_split <-survSplit(Surv(time,death) ~ .,data = pbc,cut =c(2, 4, 6),episode ="period",start ="tstart",end ="tstop")# Fix age and create follow-up by bandpbc_split <- pbc_split |>mutate(age = age + tstart,t_fup = tstop - tstart )# Factorise variable and label valuespbc_split$period <-factor(pbc_split$period,levels =c(1, 2, 3, 4),labels =c("0-2y", "2-4y", "4-6y", "6-12y"))glm(death ~offset(log(t_fup)) + treat + logb0 + period, # period goes herefamily =poisson(),data = pbc_split) |>tbl_regression(exponentiate = T)
Characteristic
IRR
95% CI
p-value
treat
placebo
—
—
azath
0.64
0.42, 0.96
0.032
logb0
11.1
6.78, 18.4
<0.001
period
0-2y
—
—
2-4y
1.21
0.70, 2.05
0.5
4-6y
2.46
1.38, 4.28
0.002
6-12y
4.67
2.39, 8.84
<0.001
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio
If you split the time in a more granular way, the results will be even closer!
Show the code
# Split the survival objectpbc_split <-survSplit(Surv(time,death) ~ .,data = pbc,cut =c(2,3,4,5,6,7,8,9),episode ="period",start ="tstart",end ="tstop")# Fix age and create follow-up by bandpbc_split <- pbc_split |>mutate(age = age + tstart,t_fup = tstop - tstart )# Factorise variable and label valuespbc_split$period <-as.factor(pbc_split$period)glm(death ~offset(log(t_fup)) + treat + logb0 + period, # period goes herefamily =poisson(),data = pbc_split) |>tbl_regression(exponentiate = T, include = treat)
Characteristic
IRR
95% CI
p-value
treat
placebo
—
—
azath
0.65
0.43, 0.98
0.040
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio
Q7
Use a Cox regression model to estimate the effect of grade (coded: 1=high; 2=low) using first the follow-up time scale. Then check whether current age is a confounder using both Cox and Poisson regression.
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio
Optional
Think whether you should test for interaction between grade and current age. If so how would you do this using a Poisson model and what are the problems of assessing such an interaction using a Cox model?
Show the code
#Fit a Poisson modelpoisson_without <-glm(chd ~offset(log(pyears)) + grade + age, family =poisson(),data = split_white)poisson_interaction <-glm(chd ~offset(log(pyears)) + grade * age, family =poisson(),data = split_white)lmtest::lrtest(poisson_without,poisson_interaction)
#Df
LogLik
Df
Chisq
Pr(>Chisq)
6
-748
NA
NA
NA
10
-741
4
13
0.01
You can also test that with cox regression. If the model is using age as the underlying timescale. The survival packages has inbuilt function for that. The cox.zph function will test proportionality of all the predictors in the model by creating interactions with time using the transformation of time (default is log) specified in the transform option. And you can also plot the effect by time