library(haven) # create tablelibrary(gtsummary)# data wrangling and other functionslibrary(tidyverse)# if you want to have a output similar to statalibrary(tidylog)#coxlibrary(survival)# plots survivallibrary(ggsurvfit)# Limit significant digits to 2, remove scientific notationoptions(digits =2, scipen =9)papua <-read_dta("datasets/pngnew.dta") |>mutate(across(where(is.labelled),as_factor))
In the first part of the practical we examine data from a pneumococcal vaccine trial performed in Papua New Guinea (pngnew.dta). The data are arranged by episode, so that each child in the trial may have several records, one for each episode of clinical pneumonia (without laboratory confirmation of pneumococcal infection) and, usually, one for the period from their last episode until they exited the trial. Thus, for each child we have data on the number of episodes they experienced (the numerator) and the length of time for which they were followed up (the denominator). We can thus calculate rates and rate ratios and an appropriate statistical model is a Poisson regression model with a random effect to account for between child variability in susceptibility/exposure.
Note
These data are from a pneumococcal vaccine trial performed in Papua New Guinea, assessing the vaccine efficacy in preventing clinical episodes of pneumonia among children.
Each child might have more than one record, because each record represents an episode of pneumonia (or the last period of follow-up, without pneumonia).
*Outcome* : any indicates whether the current period of observation ended in an episode or not
# A tibble: 5 × 10
id sex vacc any anyprev timein timeout dob datevac
<dbl> <fct> <fct> <dbl> <dbl> <date> <date> <date> <date>
1 2921 male placebo 1 0 1981-10-26 1982-04-07 1976-12-08 1981-10-26
2 2921 male placebo 1 1 1982-04-07 1982-10-18 1976-12-08 1981-10-26
3 2921 male placebo 1 2 1982-10-18 1983-05-04 1976-12-08 1981-10-26
4 2921 male placebo 1 3 1983-05-04 1983-05-25 1976-12-08 1981-10-26
5 2921 male placebo 0 4 1983-05-25 1983-11-30 1976-12-08 1981-10-26
# ℹ 1 more variable: pyrs <dbl>
Q2
Individual children can have multiple records in the database, depending on how many episodes of clinical pneumonia they experienced. Use the following commands to examine the number of children in each vaccine group and the number of episodes per child
Compared to STATA, you dont need to worry about the dataset being modified unintentionally (R>Stata). The coding is grouping the individuals by id and doing operations in each small dataset (all rows from each id). sum(any) sums all values in the “any” variable, max(as.numeric(vacc)) is a workaround of recovering the group, as each participant is either vaccinated or unvaccinated. This approach wouldn’t work if the participant could be vaccinated after enter the study
Determine how many episodes there were in total in each vaccine group
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio
Show the code
glm(event~vacc +offset(log(pyears)), family=poisson,data=dt_pois) |>tbl_regression(exponentiate = F) |>modify_column_unhide(column = std.error) #trick to show SE
Characteristic
log(IRR)
SE
95% CI
p-value
vacc
placebo
—
—
—
vaccine
-0.10
0.042
-0.18, -0.02
0.017
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio, SE = Standard Error
Q5
Refit the Poisson regression model, this time asking Stata to compute robust standard errors adjusted for clustering. What impact does taking account of the clustering have on your interpretation of the vaccine’s effect?
We have multiple options to calculate cluster robust SE in R.
Show the code
# Using the miceadds package (not my preferred option)# pois_rob <- miceadds::glm.cluster(papua,# any ~ vacc + offset(log(pyrs)),# cluster = "id",# family = "poisson")# summary(pois_rob)
Show the code
# using sandwich and lmtestlibrary(sandwich)library(lmtest)pois_no_cluster <-glm(any ~ vacc +offset(log(pyrs)),family ="poisson",data = papua)coeftest(pois_no_cluster, vcov = vcovCL, cluster =~ id)
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.00973 0.04172 -0.23 0.816
vaccvaccine -0.10117 0.06087 -1.66 0.097 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Show the code
# using parameters::model_parameters (requires sandwich package)parameters::model_parameters( pois_no_cluster,vcov ="vcovCL",vcov_args =list(cluster=papua$id),exponentiate=T)
Parameter | IRR | SE | 95% CI | z | p
-----------------------------------------------------------
(Intercept) | 0.99 | 0.04 | [0.91, 1.07] | -0.23 | 0.816
vacc [vaccine] | 0.90 | 0.06 | [0.80, 1.02] | -1.66 | 0.097
Answer
The SE increased from 0.042 to 0.061 (log IRR)
Q6
Now fit a random effects model to take account of within child clustering.
Show the code
mod_gamma <-coxph(Surv(pyrs, any) ~ vacc +frailty(id, distribution ="gamma"), data = papua)tbl_regression(mod_gamma,exponentiate = T)
Characteristic
HR
95% CI
p-value
vacc
placebo
—
—
vaccine
0.88
0.77, 1.00
0.046
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio
Coding explanation
We are using a Cox regression with a fraitly term. In the stata exercise, it has been calculated using a parametric model
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio
Answer
Controlling for age and sex results in an adjusted rate ratio of 0.91, compared with an unadjusted rate ratio of 0.88. The evidence for a vaccine effect has weakened a little (P-value is now 0.10). Given that this was a fairly large RCT we would not expect strong confounding by age or sex (this can be checked by cross-tabulating age and sex at enrolment against vaccine group).
Q10
Open the data set (hhtb.dta) and explore the dataset (glimpse) or click on it in the environment tab.
Each record represents an individual household contact of a TB case. The variable id indicates which household contacts belong to the same household (and hence are contacts of the same index case). Use the following commands to examine the distribution of contacts per household by HIV status of the index case.
# same answer from stata# hhtb |> # group_by(id) |> # count(hiv) |> # filter(!is.na(hiv)) |># ungroup() |> # mutate(n = factor(n, sort(unique(n)), ordered = T)) |> # tbl_summary(include = n, by = hiv)
Answer
The answer in STATA show you the table. I would rather see the distribution with a histogram. Here is showing the distribution of number of cases (mantoux=1) per household and HIV status.
Q12
Perform a similar analysis to examine the distribution of contacts per household by duration of cough of the index case. Why are only 58 households shown?
Show the code
hhtb |>group_by(id) |>count(cough) |>filter(!is.na(cough)) |>ungroup() |>mutate(n =factor(n, sort(unique(n)), ordered = T)) |>tbl_summary(include = n, by = cough)
Characteristic
1 N = 321
2 N = 261
n
1
5 (16%)
2 (7.7%)
2
3 (9.4%)
5 (19%)
3
5 (16%)
7 (27%)
4
5 (16%)
4 (15%)
5
4 (13%)
2 (7.7%)
6
2 (6.3%)
4 (15%)
8
3 (9.4%)
1 (3.8%)
9
2 (6.3%)
1 (3.8%)
10
1 (3.1%)
0 (0%)
11
1 (3.1%)
0 (0%)
13
1 (3.1%)
0 (0%)
1 n (%)
Q13
For household contacts, examine the distribution of tuberculin positivity by the duration of cough in the index case. Ignoring any clustering, what is the estimated odds ratio and associated 95% c.i.? What would you conclude about the association between duration of cough in the index case and tuberculin positivity in household contacts on the basis of this analysis (if you were unaware of the clustering issue)?
Show the code
table(hhtb$mantoux,hhtb$cough) |>fisher.test()
Fisher's Exact Test for Count Data
data: table(hhtb$mantoux, hhtb$cough)
p-value = 0.04
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.0 3.1
sample estimates:
odds ratio
1.8
Outcome + Outcome - Total Inc risk *
Exposed + 72 33 105 68.57 (58.78 to 77.28)
Exposed - 82 67 149 55.03 (46.68 to 63.18)
Total 154 100 254 60.63 (54.33 to 66.68)
Point estimates and 95% CIs:
-------------------------------------------------------------------
Inc risk ratio 1.25 (1.03, 1.51)
Inc odds ratio 1.78 (1.06, 3.01)
Attrib risk in the exposed * 13.54 (1.59, 25.48)
Attrib fraction in the exposed (%) 19.74 (2.51, 33.93)
Attrib risk in the population * 5.60 (-4.40, 15.59)
Attrib fraction in the population (%) 9.23 (0.49, 17.20)
-------------------------------------------------------------------
Uncorrected chi2 test that OR = 1: chi2(1) = 4.729 Pr>chi2 = 0.030
Fisher exact test that OR = 1: Pr>chi2 = 0.037
Wald confidence limits
CI: confidence interval
* Outcomes per 100 population units
Show the code
mod <-glm(mantoux~ cough, family=binomial, data=hhtb)parameters::model_parameters(mod, exponentiate=T)
Parameter | Odds Ratio | SE | 95% CI | z | p
--------------------------------------------------------------
(Intercept) | 0.64 | 0.25 | [0.30, 1.36] | -1.16 | 0.247
cough | 1.78 | 0.48 | [1.06, 3.03] | 2.16 | 0.030
Coding explanation
We can get odds ratios through multiple ways. None of them matched with STATA solution (Why use MH when there is no stratification?!)
Q14
Using the logit command, obtain a 95% confidence interval for the odds ratio based on the robust standard error taking account of clustering. Are your conclusions about the association between duration of cough in the index case and tuberculin positivity in household contacts different from question 13 above?
Parameter | Odds Ratio | SE | 95% CI | z | p
--------------------------------------------------------------
(Intercept) | 0.64 | 0.32 | [0.24, 1.73] | -0.88 | 0.377
cough | 1.78 | 0.62 | [0.90, 3.52] | 1.67 | 0.096
Answer
The confidence interval has widened to (0.90, 3.52) and we should be (even) more sceptical than before about the strength of evidence for an association (P having increased from 0.03 to 0.10). After taking clustering into account, there is only weak evidence against the null hypothesis.
Q15
Now estimate the odds ratio and obtain confidence intervals based on the GEE approach with robust standard errors and assuming an exchangeable correlation matrix. How do your results change from those in question 14?
Show the code
library(geepack)# observations from the same id should be contiguousdep_gee <-geeglm(mantoux ~ cough,data =drop_na(hhtb, cough), #geeglm doesnt accept missing valuesid = id, family ="binomial",corstr ="exchangeable")parameters::model_parameters(dep_gee, exponentiate=T)
Parameter | Odds Ratio | SE | 95% CI | Chi2(1) | p
----------------------------------------------------------------
(Intercept) | 0.58 | 0.30 | [0.21, 1.60] | 1.10 | 0.294
cough | 1.88 | 0.65 | [0.95, 3.71] | 3.27 | 0.071
Answer
The point estimate of the odds ratio for duration of cough is 1.87 with 95% c.i. (0.97, 3.60), P=0.07. The odds ratio is slightly larger than that obtained using only robust standard errors (1.78). The standard error of the log odds ratio is very similar to that obtained from the previous “robust” analysis (0.35 ~ 0.35).
The PI and CI in Stata is different: 1.88 with 95% c.i. (0.94, 3.73) (I dont know why)
Q16
Use the xtlogit command again, but this time with the re option, to fit a random effects model for tuberculin positivity in terms of the duration of cough in the index case. Is there evidence from this model of between-household variation (within-household correlation)? Use the quadchk,nooutput command to check the reliability of the estimates from the random effects model.
Show the code
tb_mixed <-glmer(mantoux ~ cough + (1|id),data = hhtb,family ="binomial")tbl_laplace <- tb_mixed |>tbl_regression(exponentiate=T)# You can get using adptative Gaussian quadrature using thistb_mixed <-glmer(mantoux ~ cough + (1|id),data = hhtb,family ="binomial",nAGQ =12)tbl_quadrad <- tb_mixed |>tbl_regression(exponentiate=T)tbl_merge(list(tbl_laplace, tbl_quadrad),tab_spanner =c("Laplace","Adaptative"))
Characteristic
Laplace
Adaptative
OR
95% CI
p-value
OR
95% CI
p-value
Duration of cough
2.15
0.96, 4.80
0.061
2.19
0.94, 5.08
0.069
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
Show the code
#m2 <- GLMMadaptive::mixed_model(fixed = mantoux ~ cough, random = ~ 1 | id, data = hhtb,# family = binomial())
Answer
The estimated odds ratio is 2.19 (95% c.i. 0.94, 5.08; P=0.07). The estimate of the odds ratio is larger than that obtained from the GEE analysis (1.88).
Ps.:The default of glmer function uses a Laplace approximation and not the adaptative Gaussian quadrature
Coding explanation
I dont think there is a similar function as “quadchk” from Stata in R. You need to refit the model with different quadrature points (nAGQ)
Q17
Use a random effects model to investigate the association between duration of cough in the index case and odds of tuberculin positivity, controlling for the effect of index cases’ HIV status and the age of the household contact, taking account of within-household clustering. Write down some sentences to summarise and interpret your results.
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
Answer
In this model we have controlled for both a cluster (household) level (HIV status of the index case) and an individual level covariate (age of the contact). “After controlling for the HIV status of the index case and the age of the contact, the OR for the association between duration of cough of 2+ months and mantoux positivity is 1.85. Compared to the OR from the unadjusted random effects model (2.19) the OR is somewhat reduced suggesting some counfounding by one or both of these variables.”
Q18
Use GEE to investigate the association between duration of cough in the index case and odds of tuberculin positivity, controlling for the effect of index cases’ HIV status and the age of the household contact, taking account of within-household clustering. What do you notice about the ORs compared to those obtained with the random effects model.
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
Answer
All of the ORs are closer to 1 than those when using random effects, as expected with a population average model.
Show the code
anova(logit_gee, logit_gee_noage)
Analysis of 'Wald statistic' Table
Model 1 mantoux ~ cough + hiv + agegrp
Model 2 mantoux ~ cough + hiv
Df X2 P(>|Chi|)
1 2 31 0.00000019 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Answer
The null hypothesis is that after taking account of the index case’s HIV status and cough duration there is no association between the age of the household contact and their Mantoux test status. The alternative hypothesis is that after taking account of the index case’s HIV status and cough duration there is an association between the age of the household contact and their Mantoux test status. There is strong evidence against the null. Note that the test has 2 df because we are testing simultaneously that 2 ORs are both equal to 1.