06: Stratifying on time for cohort studies

Author

Thiago Cerqueira Silva

Show the code
# read "dta" files
library(haven) 
# create table
library(gtsummary)
# data wrangling and other functions
library(tidyverse)
# if you want to have a output similar to stata
library(tidylog)
#cox
library(survival)
# plots survival
library(ggsurvfit)
# Limit significant digits to 2, remove scientific notation
options(digits = 2, scipen = 9)

Q1

Open a log file, change to the directory where your ASME data have been copied and then read the Whitehall data, whitehal.dta. In this practical we are interested in the effect of job grade on CHD mortality.

Show the code
whitehall <- read_stata("datasets/WHITEHAL.DTA")

# data wrangling
whitehall <- whitehall |> 
  mutate(
    id = as.integer(id),
    all = as.integer(all), # better keep numeric (not a problem in this case)
    chd = as.integer(chd),
    grade4 = case_when( # recoding as character using case_when, most models will recognize characters as factors with some exemptions (mgcv doesnt)
      grade4 == 1 ~ "admin",
      grade4 == 2 ~ "profess",
      grade4 == 3 ~ "clerical",
      grade4 == 4 ~ "other"
    ),
    smok = case_when(
      smok == 1 ~ "never",
      smok == 2 ~ "ex",
      smok == 3 ~ "1-14/day",
      smok == 4 ~ "15-24/day",
      smok == 5 ~ "25+/day"
    ),
    grade = case_when(
      grade == 1 ~ "higher",
      grade == 2 ~ "lower"),
    cholgrp = as.factor(cholgrp),
    sbpgrp = as.factor(sbpgrp))

Q2

Compute the rates for the two grades of employment categories using the commands below. pyears()

In days

Show the code
# Calculate rates
pyears(Surv(
  time = as.numeric(timein),
  time2 = as.numeric(timeout),
  event = chd
) ~ grade, data = whitehall, scale = 1) |>
  summary(n = F, rate = T, ci.r = T, scale = 1000)

Call: pyears(formula = Surv(time = as.numeric(timein), time2 = as.numeric(timeout), event = chd) ~ grade, data = whitehall, scale = 1)

number of observations = 1677

grade Events Time Event rate CI (rate)
higher 90 7429107 0.012 0.0097 - 0.015
lower 64 2653754 0.024 0.0186 - 0.031

In years

Show the code
# Calculate rates
pyears(Surv(
  time = as.numeric(timein),
  time2 = as.numeric(timeout),
  event = chd
) ~ grade, data = whitehall) |>
  summary(n = F, rate = T, ci.r = T, scale = 1000)

Call: pyears(formula = Surv(time = as.numeric(timein), time2 = as.numeric(timeout), event = chd) ~ grade, data = whitehall)

number of observations = 1677

grade Events Time Event rate CI (rate)
higher 90 20340 4.4 3.6 - 5.4
lower 64 7266 8.8 6.8 - 11.2

Calculate rates stratified by exposure (the two grades of employment: grade).

You create a survival object with Surv(); it contains duration of follow-up and status at end of follow-up. You dont need to setup it globally as in STATA. You have freedom to use different timescales for each model

You then calculate stratified rates pyears(): in the formula, you use a Surv object, and then the stratification; you then pipe this into summary()

The argument “scale” from pyears() is to scaling the results, for example if your time is in days and want to report in years use scale = 365.25 (the default value)

Calculate the RR by hand (from the two rates shown on the screen).

\(\frac{8.8}{4.4}=2\)

Q3

Now, set again the time and outcome variables with stset but this time, use the time of birth as the origin, i.e. organise the data according to current age.

Show the code
# Calculate rates
pyears(Surv(
  time = as.numeric(timein),
  time2 = as.numeric(timeout),
  origin = as.numeric(timebth),
  event = chd
) ~ grade, data = whitehall) |> 
  summary(n = F, rate = T, ci.r = T, scale = 1000)

Call: pyears(formula = Surv(time = as.numeric(timein), time2 = as.numeric(timeout), origin = as.numeric(timebth), event = chd) ~ grade, data = whitehall)

number of observations = 1677

grade Events Time Event rate CI (rate)
higher 90 20340 4.4 3.6 - 5.4
lower 64 7266 8.8 6.8 - 11.2
Show the code
# I dont know any easy alternative to see the summary (similar to STATA)
summary(Surv(as.numeric(whitehall$timein) / 365.25,
  as.numeric(whitehall$timeout) / 365.25,
  origin = as.numeric(whitehall$timebth) / 365.25,
  whitehall$chd
))
     start         stop        status    
 Min.   :40   Min.   :44   Min.   :0.00  
 1st Qu.:47   1st Qu.:64   1st Qu.:0.00  
 Median :52   Median :68   Median :0.00  
 Mean   :52   Mean   :69   Mean   :0.09  
 3rd Qu.:57   3rd Qu.:73   3rd Qu.:0.00  
 Max.   :69   Max.   :86   Max.   :1.00  

Q4, Q5 and Q6

Next we need to split the individual follow-up times into intervals specific to different agebands.

Use the stsplit command to create 5-years groups of current age between age 50 and 80 and 10-year groups for the youngest and oldest groups.

Show the code
# dividing by 365.25 to transform in years instead days
# the variables used in time, time2,origin will be removed from the new dataset
white_split <- survSplit(
  Surv(
    time = as.numeric(timein) / 365.25,
    time2 = as.numeric(timeout) / 365.25,
    event = chd,
    origin = as.numeric(timebth) / 365.25
  ) ~ .,
  data = whitehall,
  cut = c(40, seq(50, 80, 5), 90),
  episode = "ageband"
)

white_split |>  filter(id == "5001") |> 
  select(id,
         tstart, tstop, ageband) |> 
  left_join(whitehall |> 
              select(id,timein,timeout,timebth) |> 
              mutate(age_enter = (timein-timebth)/365.25)) |> 
  gt::gt()
id tstart tstop ageband Date of entry (days) Date of exit (days) Date of birth (days) age_enter
5001 47 50 2 1967-09-13 1987-01-30 1920-03-20 47.4825448643908
5001 50 55 3 1967-09-13 1987-01-30 1920-03-20 47.4825448643908
5001 55 60 4 1967-09-13 1987-01-30 1920-03-20 47.4825448643908
5001 60 65 5 1967-09-13 1987-01-30 1920-03-20 47.4825448643908
5001 65 67 6 1967-09-13 1987-01-30 1920-03-20 47.4825448643908

Q7

Note that there is no change in the information about length of follow-up

Show the code
# Stratify by grade
# use scale =1 because the time is already in years
pyears(Surv(tstart, tstop,chd) ~ grade,
       data = white_split,
       scale=1) %>%
  summary(n = F, rate = T, ci.r = T, scale = 1000)

Call: pyears(formula = Surv(tstart, tstop, chd) ~ grade, data = white_split, scale = 1)

number of observations = 6920

grade Events Time Event rate CI (rate)
higher 90 20340 4.4 3.6 - 5.4
lower 64 7266 8.8 6.8 - 11.2

Q8

Tabulate by agebands now

Show the code
pyears(Surv(tstart, tstop,chd) ~ ageband,
       data = white_split,
       scale=1) |> 
  summary(n = T, rate = T, ci.r = T, scale = 1000) #use n=T to see the number of individuals who were followed in each ageband
Call: pyears(formula = Surv(tstart, tstop, chd) ~ ageband, data = white_split, 
    scale = 1)

number of observations = 6920

 ageband    N     Events   Time   Event rate     CI (rate)    
--------- ------ -------- ------ ------------ --------------- 
    2       722      1     3138      0.32      0.0081 -  1.8 
    3      1078      9     4427      2.03      0.9296 -  3.9 
    4      1400     17     6003      2.83      1.6496 -  4.5 
    5      1500     38     6166      6.16      4.3614 -  8.5 
    6      1149     41     4398      9.32      6.6898 - 12.6 
    7       686     29     2442     11.87      7.9517 - 17.1 
    8       311     15      912     16.45      9.2047 - 27.1 
    9        74      4      118     33.78      9.2032 - 86.5 

Q9

Calculate the ageband-specific RRs for Low grade versus High grade using stmh and assess whether the effect of grade is the same across all the strata. Is it necessary to report the stratum specific estimates? Do you think that age confounds the relationship between job grade and CHD mortality?

Show the code
pyears(Surv(tstart, tstop,chd) ~ ageband+grade,
       data = white_split,
       scale=1)|> 
  summary(n = T, rate = T, ci.r = T, scale = 1000) 
Call: pyears(formula = Surv(tstart, tstop, chd) ~ ageband + grade, 
    data = white_split, scale = 1)

number of observations = 6920

---------- 
    N     
  Events  
   Time   
Event rate
CI (rate) 
---------- 

--------------------------------------- 
             grade                      
ageband     higher           lower      
------- --------------- --------------- 
              603             119       
                0               1       
   2        2667.9           470.4      
               0.0             2.1      
          0.000-  1.383   0.054- 11.843 
 
              871             207       
                6               3       
   3        3650.2           776.8      
               1.6             3.9      
          0.603-  3.578   0.796- 11.287 
 
             1074             326       
               12               5       
   4        4737.2          1266.0      
               2.5             3.9      
          1.309-  4.425   1.282-  9.216 
 
             1080             420       
               21              17       
   5        4493.7          1672.0      
               4.7            10.2      
          2.893-  7.144   5.923- 16.279 
 
              779             370       
               21              20       
   6        2900.6          1497.5      
               7.2            13.4      
          4.482- 11.067   8.158- 20.626 
 
              425             261       
               20               9       
   7        1407.6          1034.9      
              14.2             8.7      
          8.679- 21.944   3.977- 16.509 
 
              163             148       
                9               6       
   8         440.4           471.7      
              20.4            12.7      
          9.344- 38.793   4.668- 27.688 
 
               31              43       
                1               3       
   9          42.2            76.2      
              23.7            39.4      
          0.600-132.056   8.116-115.009 
--------------------------------------- 
Show the code
# To get the RR of grade adjusted for ageband, we need additional steps

dt_rates <- pyears(Surv(tstart, tstop,chd) ~ ageband+grade,
       data = white_split,
       scale=1,
       data.frame = T) 
# ageband is coded as numeric in the output
dt_rates$data$ageband <- as.factor(dt_rates$data$ageband)
dt_mh <- dt_rates$data |> arrange(desc(grade))

Painful to do MH in R

Show the code
dat.tab04 <- sapply(2:length(unique(dt_mh$ageband)), function(x) 
   as.matrix(dt_mh[dt_mh$ageband == x,c(5,3)], ncol = 2, byrow = TRUE), 
   simplify = "array")
epiR::epi.2by2(dat = dat.tab04, method = "cohort.time", digits = 2, 
   conf.level = 0.95, interpret = FALSE, 
   outcome = "as.columns")
             Outcome +     Time at risk                 Inc rate *
Exposed +           61 7189.35230511529        0.85 (0.65 to 1.09)
Exposed -           89 20297.5953348253        0.44 (0.35 to 0.54)
Total              150 27486.9476399405        0.55 (0.46 to 0.64)

Point estimates and 95% CIs:
-------------------------------------------------------------------
Inc rate ratio (crude)                         1.94 (1.37, 2.71)
Inc rate ratio (M-H)                           1.38 (1.01, 1.90)
Inc rate ratio (crude:M-H)                     1.40
Attrib rate in the exposed (crude) *           0.41 (0.178, 0.642)
Attrib rate in the exposed (M-H) *             0.22 (-0.016, 0.465)
Attrib rate (crude:M-H)                        1.83
-------------------------------------------------------------------
 Wald confidence limits
 M-H: Mantel-Haenszel; CI: confidence interval
 * Outcomes per 100 units of population time at risk 

Model based is way better

Show the code
mod1 <-  glm(event ~ grade + ageband + offset(log(pyears)),
      family = poisson,
      data= dt_rates$data) 
mod_int <-  glm(event ~ grade*ageband + offset(log(pyears)),
      family = poisson,
      data= dt_rates$data) 
tbl_regression(mod1,exponentiate = T)
Characteristic IRR 95% CI p-value
grade


    higher
    lower 1.41 1.01, 1.96 0.040
ageband


    2
    3 6.32 1.19, 117 0.080
    4 8.68 1.78, 156 0.036
    5 18.5 4.01, 328 0.004
    6 27.2 5.92, 483 0.001
    7 33.7 7.18, 601 <0.001
    8 45.2 9.10, 819 <0.001
    9 88.9 13.1, 1,746 <0.001
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio

The Poisson regression usually is used for model count data, but we can also model rates with it. To do that, we need to transform the count in rates, and the offset variable will inform us about the “size” of exposure of each individual/unit. The regression coefficient for an offset variable is constrained to be 1, thus allowing our model to represent rates rather than counts. When you fit a model with an offset, the exponentiated regression coefficient of a predictor variable tells us how much the expected rate changes multiplicatively for a one unit increase in the predictor variable. When using an offset, the assumption is made that doubling the unit size (measurement time, etc.) will lead to a doubling of the count outcome. If this assumption is not appropriate, controlling for the unit size as a covariate instead of an offset may be more appropriate. A likelihood ratio test could be used to determine if the fit of these two models is significantly different.

Why the offset is in log? Consider the rate \(\lambda\)

\[\lambda=\frac{\mu_{i}}{t_{i}}\]

\[ log(\lambda)=\alpha +\beta x \]

The model will be \[ log(\mu)-log(t)=\alpha +\beta x \]

\[ log(\mu)=\alpha +\beta x +log(t) \]

Check interaction

Show the code
lmtest::lrtest(mod1,mod_int)
#Df LogLik Df Chisq Pr(>Chisq)
9 -35 NA NA NA
16 -29 7 13 0.06

Different answers from STATA output. Here, we are modelling it using Poisson and testing interaction using a likelihood test. Instead, in the Stata solutions it is with stratified rate ratios (Mantel-Haenszel)

All subsequent questions will be answered using model based / likelihood ratio tests for interactions

Q10

Using strate and stmh examine (a) the effect of smoking on CHD mortality, (b) whether smoking confounds the relationship between job grade and CHD mortality. You may wish to recode the smoking variable into three categories (i) never, (ii) ex, (iii) current smokers

Show the code
# Recode
white_split <- white_split |> 
  mutate(smok3 = as.factor(case_when(smok == "never" ~ "never",
                                     smok == "ex" ~ "ex",
                                     smok == "1-14/day" ~ "current",
                                     smok == "15-24/day" ~ "current",
                                     smok == "25+/day" ~ "current")),
         smok3 = fct_relevel(smok3, "never", "ex", "current"))# Order level



pyears(Surv(tstart, tstop,chd) ~ smok3+grade,
       data = white_split,
       scale=1)|> 
  summary(n = T, rate = T, ci.r = T, scale = 1000) 
Call: pyears(formula = Surv(tstart, tstop, chd) ~ smok3 + grade, data = white_split, 
    scale = 1)

number of observations = 6920

---------- 
    N     
  Events  
   Time   
Event rate
CI (rate) 
---------- 

------------------------------- 
           grade                
 smok3    higher       lower    
------- ----------- ----------- 
           1094         245     
              6           5     
 never    4603.9       988.2    
             1.3         5.1    
         0.48- 2.84  1.64-11.81 
 
           2116         595     
             38          17     
  ex      8462.5      2308.1    
             4.5         7.4    
         3.18- 6.16  4.29-11.79 
 
           1816        1054     
             46          42     
current   7273.5      3969.3    
             6.3        10.6    
         4.63- 8.44  7.63-14.30 
------------------------------- 
Show the code
# To get the RR of grade adjusted for ageband, we need additional steps

dt_rates <- pyears(Surv(tstart, tstop,chd) ~ smok3+grade,
       data = white_split,
       scale=1,
       data.frame = T) 
dt_mh <- dt_rates$data 
Show the code
mod1 <-  glm(event ~ grade + smok3 + offset(log(pyears)),
      family = poisson,
      data= dt_rates$data) 
mod_int <-  glm(event ~ grade*smok3 + offset(log(pyears)),
      family = poisson,
      data= dt_rates$data) 
tbl_regression(mod1,exponentiate = T)
Characteristic IRR 95% CI p-value
grade


    higher
    lower 1.76 1.27, 2.43 <0.001
smok3


    never
    ex 2.53 1.38, 5.10 0.005
    current 3.56 1.98, 7.08 <0.001
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio

Check interaction

Show the code
lmtest::lrtest(mod1,mod_int)
#Df LogLik Df Chisq Pr(>Chisq)
4 -15 NA NA NA
6 -14 2 1.7 0.42

Q11

Examine the effect of job grade on CHD mortality adjusted for ageband and smoking simultaneously. What do you conclude about the effect of job grade on CHD mortality having adjusted for both of these factors together?

Show the code
pyears(Surv(tstart, tstop,chd) ~ smok3+grade+ageband,
       data = white_split,
       scale=1)|> 
  summary(n = T, rate = T, ci.r = T, scale = 1000) 
Call: pyears(formula = Surv(tstart, tstop, chd) ~ smok3 + grade + ageband, 
    data = white_split, scale = 1)

number of observations = 6920

---------- 
    N     
  Events  
   Time   
Event rate
CI (rate) 
---------- 

: ageband=2 
--------------------------------------- 
             grade                      
 smok3      higher           lower      
------- --------------- --------------- 
              162              21       
                0               0       
 never       848.37           85.47     
               0.00            0.00     
          0.000-  4.348   0.000- 43.161 
 
              221              35       
                0               0       
  ex         907.18          140.02     
               0.00            0.00     
          0.000-  4.066   0.000- 26.346 
 
              220              63       
                0               1       
current      912.38          244.96     
               0.00            4.08     
          0.000-  4.043   0.103- 22.745 
--------------------------------------- 
: ageband=3 
--------------------------------------- 
             grade                      
 smok3      higher           lower      
------- --------------- --------------- 
              215              29       
                0               0       
 never       935.63          124.46     
               0.00            0.00     
          0.000-  3.943   0.000- 29.640 
 
              341              62       
                1               1       
  ex        1412.29          227.72     
               0.71            4.39     
          0.018-  3.945   0.111- 24.467 
 
              315             116       
                5               2       
current     1302.28          424.60     
               3.84            4.71     
          1.247-  8.960   0.570- 17.015 
--------------------------------------- 
: ageband=4 
--------------------------------------- 
             grade                      
 smok3      higher           lower      
------- --------------- --------------- 
              247              41       
                1               1       
 never      1118.70          166.52     
               0.89            6.01     
          0.023-  4.980   0.152- 33.459 
 
              442             102       
                5               1       
  ex        1909.13          398.85     
               2.62            2.51     
          0.850-  6.112   0.063- 13.969 
 
              385             183       
                6               3       
current     1709.37          700.66     
               3.51            4.28     
          1.288-  7.640   0.883- 12.513 
--------------------------------------- 
: ageband=5 
--------------------------------------- 
             grade                      
 smok3      higher           lower      
------- --------------- --------------- 
              231              53       
                4               0       
 never       900.08          218.29     
               4.44            0.00     
          1.211- 11.378   0.000- 16.899 
 
              461             129       
                6               3       
  ex        1979.09          530.70     
               3.03            5.65     
          1.113-  6.599   1.166- 16.520 
 
              388             238       
               11              14       
current     1614.51          923.05     
               6.81           15.17     
          3.401- 12.191   8.292- 25.448 
--------------------------------------- 
: ageband=6 
--------------------------------------- 
             grade                      
 smok3      higher           lower      
------- --------------- --------------- 
              140              46       
                0               0       
 never       500.40          187.88     
               0.00            0.00     
          0.000-  7.372   0.000- 19.634 
 
              355             118       
               12               9       
  ex        1347.42          486.21     
               8.91           18.51     
          4.602- 15.557   8.464- 35.139 
 
              284             206       
                9              11       
current     1052.77          823.42     
               8.55           13.36     
          3.909- 16.228   6.669- 23.903 
--------------------------------------- 
: ageband=7 
--------------------------------------- 
             grade                      
 smok3      higher           lower      
------- --------------- --------------- 
               71              30       
                1               2       
 never       231.88          126.92     
               4.31           15.76     
          0.109- 24.029   1.908- 56.924 
 
              201              88       
               10               2       
  ex         671.00          342.64     
              14.90            5.84     
          7.147- 27.407   0.707- 21.086 
 
              153             143       
                9               5       
current      504.72          565.31     
              17.83            8.84     
          8.154- 33.850   2.872- 20.640 
--------------------------------------- 
: ageband=8 
--------------------------------------- 
             grade                      
 smok3      higher           lower      
------- --------------- --------------- 
               24              19       
                0               1       
 never        63.93           72.80     
               0.00           13.74     
          0.000- 57.706   0.348- 76.532 
 
               81              47       
                4               0       
  ex         215.82          152.47     
              18.53            0.00     
          5.050- 47.455   0.000- 24.193 
 
               58              82       
                5               5       
current      160.67          246.39     
              31.12           20.29     
         10.104- 72.623   6.589- 47.357 
--------------------------------------- 
: ageband=9 
--------------------------------------- 
             grade                      
 smok3      higher           lower      
------- --------------- --------------- 
                4               6       
                0               1       
 never         4.87            5.87     
               0.00          170.29     
          0.000-756.995   4.311-948.797 
 
               14              14       
                0               1       
  ex          20.55           29.48     
               0.00           33.92     
          0.000-179.535   0.859-188.976 
 
               13              23       
                1               1       
current       16.77           40.88     
              59.62           24.46     
          1.510-332.204   0.619-136.307 
--------------------------------------- 
Show the code
dt_rates <- pyears(Surv(tstart, tstop,chd) ~ smok3+grade+ageband,
       data = white_split,
       scale=1,
       data.frame = T)
# ageband is coded as numeric in the output
dt_rates$data$ageband <- as.factor(dt_rates$data$ageband)
mod1 <-  glm(event ~ grade+smok3 + ageband+offset(log(pyears)),
      family = poisson,
      data= dt_rates$data) 
tbl_regression(mod1,exponentiate = T)
Characteristic IRR 95% CI p-value
grade


    higher
    lower 1.27 0.91, 1.78 0.2
smok3


    never
    ex 2.11 1.15, 4.26 0.024
    current 3.12 1.73, 6.20 <0.001
ageband


    2
    3 6.09 1.15, 112 0.086
    4 8.26 1.70, 149 0.040
    5 17.3 3.76, 308 0.005
    6 25.3 5.50, 449 0.001
    7 31.3 6.67, 558 <0.001
    8 42.4 8.53, 768 <0.001
    9 80.8 11.9, 1,587 <0.001
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio
Show the code
mod_int <-  glm(event ~ grade*smok3 + ageband+offset(log(pyears)),
      family = poisson,
      data= dt_rates$data) 
lmtest::lrtest(mod1,mod_int)
#Df LogLik Df Chisq Pr(>Chisq)
11 -75 NA NA NA
13 -74 2 1.2 0.53

There is no evidence of effect modification between grade and smoking. \(\chi_{2}^2 = 1.25\) and p=0.53

Q12

The file whchd.dta holds the Whitehall data already expanded by current age and period (in 5-years periods) and has been stset for the analysis of CHD events. It also holds a variable, called rate, giving the corresponding age and period specific CHD mortality rates for England and Wales (per 1,000,000 person years). Compute the CHD SMR for the cohort of Whitehall civil servants

Show the code
# way more complicated than STATA (using this format of data)
# the stranges names from stata dont work verwy well in R - use clean_names to fix
whcd <- haven::read_dta("datasets/WHCHD.DTA") |> janitor::clean_names()
# get the expected number of cases - adjusted by period, ageband and grade

expectdt <- whcd |> group_by(ageband, grade, period) |> 
  reframe(a=mean(rate)/1e6,
          totalpy=sum(t-t0)) |> 
  reframe(e=a*totalpy) |> 
  reframe(sum(e))


observedt <- whcd |> 
  reframe(sum(chd, na.rm=T))

popEpi::poisson.ci(147,227)
x pt rate lower upper conf.level
147 227 0.65 0.55 0.76 0.95