# read "dta" fileslibrary(haven) # create tablelibrary(gtsummary)# data wrangling and other functionslibrary(tidyverse)# if you want to have a output similar to statalibrary(tidylog)#coxlibrary(survival)# plots survivallibrary(ggsurvfit)# Limit significant digits to 2, remove scientific notationoptions(digits =2, scipen =9)
Q1
Open a log file, change to the directory where your ASME data have been copied and then read the Whitehall data, whitehal.dta. In this practical we are interested in the effect of job grade on CHD mortality.
Show the code
whitehall <-read_stata("datasets/WHITEHAL.DTA")# data wranglingwhitehall <- whitehall |>mutate(id =as.integer(id),all =as.integer(all), # better keep numeric (not a problem in this case)chd =as.integer(chd),grade4 =case_when( # recoding as character using case_when, most models will recognize characters as factors with some exemptions (mgcv doesnt) grade4 ==1~"admin", grade4 ==2~"profess", grade4 ==3~"clerical", grade4 ==4~"other" ),smok =case_when( smok ==1~"never", smok ==2~"ex", smok ==3~"1-14/day", smok ==4~"15-24/day", smok ==5~"25+/day" ),grade =case_when( grade ==1~"higher", grade ==2~"lower"),cholgrp =as.factor(cholgrp),sbpgrp =as.factor(sbpgrp))
Q2
Compute the rates for the two grades of employment categories using the commands below. pyears()
Calculate rates stratified by exposure (the two grades of employment: grade).
You create a survival object with Surv(); it contains duration of follow-up and status at end of follow-up. You dont need to setup it globally as in STATA. You have freedom to use different timescales for each model
You then calculate stratified rates pyears(): in the formula, you use a Surv object, and then the stratification; you then pipe this into summary()
The argument “scale” from pyears() is to scaling the results, for example if your time is in days and want to report in years use scale = 365.25 (the default value)
Calculate the RR by hand (from the two rates shown on the screen).
Answer
\(\frac{8.8}{4.4}=2\)
Q3
Now, set again the time and outcome variables with stset but this time, use the time of birth as the origin, i.e. organise the data according to current age.
# I dont know any easy alternative to see the summary (similar to STATA)summary(Surv(as.numeric(whitehall$timein) /365.25,as.numeric(whitehall$timeout) /365.25,origin =as.numeric(whitehall$timebth) /365.25, whitehall$chd))
start stop status
Min. :40 Min. :44 Min. :0.00
1st Qu.:47 1st Qu.:64 1st Qu.:0.00
Median :52 Median :68 Median :0.00
Mean :52 Mean :69 Mean :0.09
3rd Qu.:57 3rd Qu.:73 3rd Qu.:0.00
Max. :69 Max. :86 Max. :1.00
Q4, Q5 and Q6
Next we need to split the individual follow-up times into intervals specific to different agebands.
Use the stsplit command to create 5-years groups of current age between age 50 and 80 and 10-year groups for the youngest and oldest groups.
Show the code
# dividing by 365.25 to transform in years instead days# the variables used in time, time2,origin will be removed from the new datasetwhite_split <-survSplit(Surv(time =as.numeric(timein) /365.25,time2 =as.numeric(timeout) /365.25,event = chd,origin =as.numeric(timebth) /365.25 ) ~ .,data = whitehall,cut =c(40, seq(50, 80, 5), 90),episode ="ageband")white_split |>filter(id =="5001") |>select(id, tstart, tstop, ageband) |>left_join(whitehall |>select(id,timein,timeout,timebth) |>mutate(age_enter = (timein-timebth)/365.25)) |> gt::gt()
id
tstart
tstop
ageband
Date of entry (days)
Date of exit (days)
Date of birth (days)
age_enter
5001
47
50
2
1967-09-13
1987-01-30
1920-03-20
47.4825448643908
5001
50
55
3
1967-09-13
1987-01-30
1920-03-20
47.4825448643908
5001
55
60
4
1967-09-13
1987-01-30
1920-03-20
47.4825448643908
5001
60
65
5
1967-09-13
1987-01-30
1920-03-20
47.4825448643908
5001
65
67
6
1967-09-13
1987-01-30
1920-03-20
47.4825448643908
Q7
Note that there is no change in the information about length of follow-up
Show the code
# Stratify by grade# use scale =1 because the time is already in yearspyears(Surv(tstart, tstop,chd) ~ grade,data = white_split,scale=1) %>%summary(n = F, rate = T, ci.r = T, scale =1000)
pyears(Surv(tstart, tstop,chd) ~ ageband,data = white_split,scale=1) |>summary(n = T, rate = T, ci.r = T, scale =1000) #use n=T to see the number of individuals who were followed in each ageband
Calculate the ageband-specific RRs for Low grade versus High grade using stmh and assess whether the effect of grade is the same across all the strata. Is it necessary to report the stratum specific estimates? Do you think that age confounds the relationship between job grade and CHD mortality?
# To get the RR of grade adjusted for ageband, we need additional stepsdt_rates <-pyears(Surv(tstart, tstop,chd) ~ ageband+grade,data = white_split,scale=1,data.frame = T) # ageband is coded as numeric in the outputdt_rates$data$ageband <-as.factor(dt_rates$data$ageband)dt_mh <- dt_rates$data |>arrange(desc(grade))
Outcome + Time at risk Inc rate *
Exposed + 61 7189.35230511529 0.85 (0.65 to 1.09)
Exposed - 89 20297.5953348253 0.44 (0.35 to 0.54)
Total 150 27486.9476399405 0.55 (0.46 to 0.64)
Point estimates and 95% CIs:
-------------------------------------------------------------------
Inc rate ratio (crude) 1.94 (1.37, 2.71)
Inc rate ratio (M-H) 1.38 (1.01, 1.90)
Inc rate ratio (crude:M-H) 1.40
Attrib rate in the exposed (crude) * 0.41 (0.178, 0.642)
Attrib rate in the exposed (M-H) * 0.22 (-0.016, 0.465)
Attrib rate (crude:M-H) 1.83
-------------------------------------------------------------------
Wald confidence limits
M-H: Mantel-Haenszel; CI: confidence interval
* Outcomes per 100 units of population time at risk
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio
Explanation - Offset
The Poisson regression usually is used for model count data, but we can also model rates with it. To do that, we need to transform the count in rates, and the offset variable will inform us about the “size” of exposure of each individual/unit. The regression coefficient for an offset variable is constrained to be 1, thus allowing our model to represent rates rather than counts. When you fit a model with an offset, the exponentiated regression coefficient of a predictor variable tells us how much the expected rate changes multiplicatively for a one unit increase in the predictor variable. When using an offset, the assumption is made that doubling the unit size (measurement time, etc.) will lead to a doubling of the count outcome. If this assumption is not appropriate, controlling for the unit size as a covariate instead of an offset may be more appropriate. A likelihood ratio test could be used to determine if the fit of these two models is significantly different.
Why the offset is in log? Consider the rate \(\lambda\)
\[\lambda=\frac{\mu_{i}}{t_{i}}\]
\[ log(\lambda)=\alpha +\beta x \]
The model will be \[ log(\mu)-log(t)=\alpha +\beta x \]
\[ log(\mu)=\alpha +\beta x +log(t) \]
Check interaction
Show the code
lmtest::lrtest(mod1,mod_int)
#Df
LogLik
Df
Chisq
Pr(>Chisq)
9
-35
NA
NA
NA
16
-29
7
13
0.06
Different answers from STATA output. Here, we are modelling it using Poisson and testing interaction using a likelihood test. Instead, in the Stata solutions it is with stratified rate ratios (Mantel-Haenszel)
All subsequent questions will be answered using model based / likelihood ratio tests for interactions
Q10
Using strate and stmh examine (a) the effect of smoking on CHD mortality, (b) whether smoking confounds the relationship between job grade and CHD mortality. You may wish to recode the smoking variable into three categories (i) never, (ii) ex, (iii) current smokers
Call: pyears(formula = Surv(tstart, tstop, chd) ~ smok3 + grade, data = white_split,
scale = 1)
number of observations = 6920
----------
N
Events
Time
Event rate
CI (rate)
----------
-------------------------------
grade
smok3 higher lower
------- ----------- -----------
1094 245
6 5
never 4603.9 988.2
1.3 5.1
0.48- 2.84 1.64-11.81
2116 595
38 17
ex 8462.5 2308.1
4.5 7.4
3.18- 6.16 4.29-11.79
1816 1054
46 42
current 7273.5 3969.3
6.3 10.6
4.63- 8.44 7.63-14.30
-------------------------------
Show the code
# To get the RR of grade adjusted for ageband, we need additional stepsdt_rates <-pyears(Surv(tstart, tstop,chd) ~ smok3+grade,data = white_split,scale=1,data.frame = T) dt_mh <- dt_rates$data
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio
Check interaction
Show the code
lmtest::lrtest(mod1,mod_int)
#Df
LogLik
Df
Chisq
Pr(>Chisq)
4
-15
NA
NA
NA
6
-14
2
1.7
0.42
Q11
Examine the effect of job grade on CHD mortality adjusted for ageband and smoking simultaneously. What do you conclude about the effect of job grade on CHD mortality having adjusted for both of these factors together?
There is no evidence of effect modification between grade and smoking. \(\chi_{2}^2 = 1.25\) and p=0.53
Q12
The file whchd.dta holds the Whitehall data already expanded by current age and period (in 5-years periods) and has been stset for the analysis of CHD events. It also holds a variable, called rate, giving the corresponding age and period specific CHD mortality rates for England and Wales (per 1,000,000 person years). Compute the CHD SMR for the cohort of Whitehall civil servants
Show the code
# way more complicated than STATA (using this format of data)# the stranges names from stata dont work verwy well in R - use clean_names to fixwhcd <- haven::read_dta("datasets/WHCHD.DTA") |> janitor::clean_names()# get the expected number of cases - adjusted by period, ageband and gradeexpectdt <- whcd |>group_by(ageband, grade, period) |>reframe(a=mean(rate)/1e6,totalpy=sum(t-t0)) |>reframe(e=a*totalpy) |>reframe(sum(e))observedt <- whcd |>reframe(sum(chd, na.rm=T))popEpi::poisson.ci(147,227)