07: Poisson regression for cohort studies

Author

Thiago Cerqueira Silva

Show the code
# read "dta" files
library(haven) 
# create table
library(gtsummary)
# data wrangling and other functions
library(tidyverse)
# if you want to have a output similar to stata
library(tidylog)
#cox
library(survival)
# plots survival
library(ggsurvfit)
# Limit significant digits to 2, remove scientific notation
options(digits = 2, scipen = 9)

The following questions refer to the dataset ondrate.dta. This contains data on 1536 individuals living in an onchocercal zone in Nigeria, who were followed over a period up to 3 years to study the incidence of optic nerve disease (OND) and the extent to which onchocerciasis is a risk factor for OND.

Q1

Use the desc-STATA / glimpse() and list(STATA) / str() commands to find out what information the dataset contains.

Show the code
ondrate <- read_stata("datasets/ondrate.dta")

glimpse(ondrate)
Rows: 1,536
Columns: 8
$ id      <dbl> 18, 23, 30, 39, 42, 43, 48, 56, 62, 63, 72, 73, 75, 78, 79, 90…
$ disc2   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0,…
$ age     <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 3, 1, 1, 1, 1…
$ sex     <dbl> 1, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 2, 2, 2, 1, 1, 2, 1,…
$ mfpermg <dbl+lbl> 2, 0, 2, 2, 1, 0, 1, 0, 2, 1, 2, 2, 2, 1, 2, 1, 2, 0, 0, 0…
$ pyrs    <dbl> 2.7, 1.9, 2.7, 2.7, 2.7, 2.8, 2.7, 2.7, 2.7, 1.9, 2.7, 2.7, 2.…
$ start   <date> 1983-01-01, 1983-01-01, 1983-01-01, 1983-01-01, 1983-01-01, 1…
$ end     <date> 1985-09-27, 1984-11-28, 1985-09-27, 1985-09-27, 1985-09-27, 1…
Show the code
ondrate <- read_stata("datasets/ondrate.dta") |> 
  mutate(across(where(is.labelled),as_factor))
# alternative
#str(ondrate)

Age is coded as numeric, but it is a categorical variable in the original dataset. We can convert all variables with labels (STATA) to factor. The code mutate(across(where(is.labelled),as_factor)) checks if a column has labels, if it has, convert to factor

Q2

Use the strate(STATA)/ pyears() command to examine the patterns of incidence of OND by age, sex and microfilarial load (a measure of the intensity of onchocercal infection). Don’t forget that you will need to issue the following command before you can use any st commands:

Show the code
# Rates by age
# The default value of scale in
pyears(Surv(time = as.numeric(start), 
                         time2 = as.numeric(end), 
                         event = disc2) ~ age, ondrate, scale = 365.25) %>% summary(n = F, rate = T, ci.r = T, scale = 1000)

Call: pyears(formula = Surv(time = as.numeric(start), time2 = as.numeric(end), event = disc2) ~ age, data = ondrate, scale = 365.25)

number of observations = 1536

age Events Time Event rate CI (rate)
<35 19 2352 8.1 4.9 - 13
35-44 16 812 19.7 11.3 - 32
45-54 20 453 44.2 27.0 - 68
55+ 16 285 56.1 32.0 - 91
Show the code
# Rates by sex
pyears(Surv(time = as.numeric(start), 
                         time2 = as.numeric(end), 
                         event = disc2) ~ sex, ondrate)  |>  summary(n = F, rate = T, ci.r = T, scale = 1000)

Call: pyears(formula = Surv(time = as.numeric(start), time2 = as.numeric(end), event = disc2) ~ sex, data = ondrate)

number of observations = 1536

sex Events Time Event rate CI (rate)
1 38 1858 20 14 - 28
2 33 2044 16 11 - 23
Show the code
# Rates by microfilarial load
pyears(Surv(time = as.numeric(start), 
                         time2 = as.numeric(end), 
                         event = disc2) ~ mfpermg, ondrate)  |>  summary(n = F, rate = T, ci.r = T, scale = 1000)

Call: pyears(formula = Surv(time = as.numeric(start), time2 = as.numeric(end), event = disc2) ~ mfpermg, data = ondrate)

number of observations = 1536

mfpermg Events Time Event rate CI (rate)
0 14 1157 12 6.6 - 20
1-9.99 18 1511 12 7.1 - 19
10+ 39 1233 32 22.5 - 43

Q3

Using the streg(STATA) / glm command and the dist(exp)(STATA)/family=poisson option, fit simple Poisson regression models to estimate (separately) the crude effects of age, sex and microfilarial load on OND incidence. Compare the rate ratio estimates obtained from streg (STATA)/glm (Poisson regression) with those obtained using stmh.

(As explained in Pratical 6 - Q9, it is painfull to do mantel haenszel rate ratio in R and there is no gain in conduct MH instead Poisson regression)

Show the code
# Age
glm(disc2 ~ age + offset(log(pyrs)),
    family = poisson,
    data = ondrate)  |> 
  tbl_regression(exponentiate=T) |> 
  modify_caption(caption = "Age")
Age
Characteristic IRR 95% CI p-value
Age at start


    <35
    35-44 2.44 1.24, 4.74 0.009
    45-54 5.47 2.91, 10.3 <0.001
    55+ 6.94 3.52, 13.5 <0.001
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio
Show the code
# Sex
glm(disc2 ~ sex + offset(log(pyrs)),
    family = poisson,
    data = ondrate)  |> 
  tbl_regression(exponentiate=T) |> 
  modify_caption(caption = "Sex")
Sex
Characteristic IRR 95% CI p-value
Sex: 1=male, 2=female 0.79 0.49, 1.26 0.3
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio
Show the code
# Microfilarial load
glm(disc2 ~ mfpermg + offset(log(pyrs)),
    family = poisson,
    data = ondrate)  |> 
  tbl_regression(exponentiate=T) |> 
  modify_caption(caption = "Microfilarial load")
Microfilarial load
Characteristic IRR 95% CI p-value
Microfilariae per mg


    0
    1-9.99 0.98 0.49, 2.01 >0.9
    10+ 2.61 1.45, 4.98 0.002
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio

Q4

Are either age or sex associated with microfilarial load (MfL)? Which, if either, are likely to confound the association between microfilarial load and incidence of OND?

Show the code
ondrate |> 
  tbl_summary(by = mfpermg,
              include = c(age,sex)) |> 
  add_p()
Characteristic 0
N = 4711
1-9.99
N = 5881
10+
N = 4771
p-value2
Age at start


0.019
    <35 315 (67%) 342 (58%) 268 (56%)
    35-44 78 (17%) 136 (23%) 106 (22%)
    45-54 48 (10%) 70 (12%) 60 (13%)
    55+ 30 (6.4%) 40 (6.8%) 43 (9.0%)
Sex: 1=male, 2=female


0.12
    1 211 (45%) 278 (47%) 245 (51%)
    2 260 (55%) 310 (53%) 232 (49%)
1 n (%)
2 Pearson’s Chi-squared test

MfL of males and females is similar. Age is potential confounder, sex is unlikely to be

Q5

Use Poisson regression to obtain rate ratio estimates for microfilarial load adjusted for (i) age, (ii) sex, (iii) both age and sex. Is there any indication of confounding?

Show the code
# Age
tbl_age <- glm(disc2 ~ mfpermg + age + offset(log(pyrs)),
    family = poisson,
    data = ondrate)  |> 
  tbl_regression(exponentiate=T) 

# Sex
tbl_sex <- glm(disc2 ~ mfpermg+ sex + offset(log(pyrs)),
    family = poisson,
    data = ondrate)  |> 
  tbl_regression(exponentiate=T) 

# Microfilarial load
tbl_both <- glm(disc2 ~ mfpermg +sex+age+ offset(log(pyrs)),
    family = poisson,
    data = ondrate)  |> 
  tbl_regression(exponentiate=T) 


tbl_merge(list(
  tbl_age,
  tbl_sex,
  tbl_both
  ),
         tab_spanner = c("**Adjusted for Age**", "**Sex**",
                         "**Both**") )
Characteristic
Adjusted for Age
Sex
Both
IRR 95% CI p-value IRR 95% CI p-value IRR 95% CI p-value
Microfilariae per mg








    0


    1-9.99 0.91 0.45, 1.87 0.8 0.98 0.49, 2.01 >0.9 0.92 0.46, 1.88 0.8
    10+ 2.28 1.26, 4.35 0.008 2.59 1.44, 4.94 0.002 2.28 1.27, 4.36 0.008
Age at start








    <35




    35-44 2.37 1.20, 4.61 0.011


2.34 1.19, 4.57 0.012
    45-54 5.27 2.80, 9.95 <0.001


5.24 2.78, 9.89 <0.001
    55+ 6.37 3.23, 12.4 <0.001


6.32 3.20, 12.3 <0.001
Sex: 1=male, 2=female


0.82 0.51, 1.31 0.4 0.88 0.55, 1.40 0.6
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio

Compared with the crude rate ratios, the age-adjusted rate ratios for MfL groups 1-9.99 and 10+ are somewhat reduced, indicating some confounding as expected (those with high MfL tend to be older and at greater risk because of their older age).

Adjusting for sex has little impact on the mf load rate ratios, indicating little or no confounding (again as expected – little difference in mf load between the sexes). Adjusting for sex in addition to age changes the mf load rate ratios only slightly. There is no evidence (Wald test p-value of 0.587) that sex is associated with OND.

Q6

Poisson regression models assume, unless you include interaction terms, that effects of different variables combine multiplicatively. Is there any evidence of an interaction between age and microfilarial load? Between sex and microfilarial load?

Sex

Show the code
tbl_no_int <- glm(disc2 ~ mfpermg +sex+ offset(log(pyrs)),
    family = poisson,
    data = ondrate) 
tbl_w_int <- glm(disc2 ~ mfpermg*sex+ offset(log(pyrs)),
    family = poisson,
    data = ondrate) 
lmtest::lrtest(tbl_no_int,tbl_w_int)
#Df LogLik Df Chisq Pr(>Chisq)
4 -276 NA NA NA
6 -275 2 0.31 0.86

Age

Show the code
tbl_no_int <- glm(disc2 ~ mfpermg +age+ offset(log(pyrs)),
    family = poisson,
    data = ondrate) 
tbl_w_int <- glm(disc2 ~ mfpermg*age+ offset(log(pyrs)),
    family = poisson,
    data = ondrate) 
lmtest::lrtest(tbl_no_int,tbl_w_int)
#Df LogLik Df Chisq Pr(>Chisq)
6 -256 NA NA NA
12 -251 6 9.1 0.17

The likelihood ratio test for sex produces p=0.856, so there is no evidence of an interaction between mf load and sex. i.e. there is no evidence that the effect of MfL on OND is different for males and females. Similarly, there is no evidence of an interaction between age and MfL (p=0.17), however it is important to note the number of parameters (6) used in this LRT. You could try testing age as ordinal variable (also no evidence of interaction, p = 0.14)

Q7

Fitting the age variable included in the dataset as a factor introduces three parameters into the model. Would it be reasonable to treat the age variable as continuous and thus reduce the number of parameters in the model? How would you assess this?

Show the code
tbl_ordinal <- glm(disc2 ~ as.numeric(age)+ offset(log(pyrs)),
    family = poisson,
    data = ondrate) 
tbl_cat <- glm(disc2 ~ age+ offset(log(pyrs)),
    family = poisson,
    data = ondrate) 
lmtest::lrtest(tbl_ordinal,tbl_cat)
#Df LogLik Df Chisq Pr(>Chisq)
2 -264 NA NA NA
4 -263 2 2.1 0.34

The likelihood ratio test shows no evidence against (or in favor) of using age as ordinal (linear)

Q8

Open whitehal.dta and use Poisson regression to repeat the analyses in question 11 of practical 5, i.e. examine the effect of job grade on CHD mortality adjusted for ageband and smoking status simultaneously. How does the result obtained from Poisson regression compare to that obtained using the stmh command?

Q11 of Session 6 already has it