You have been asked to design an (unmatched) cluster-randomised trial to measure the impact of immunisation with a newly developed pneumococcal vaccine on all-cause child mortality in a rural African population. Villages are to be randomly allocated to the intervention or control arms. In the intervention arm, all infants will receive the pneumococcal vaccine in a 3 dose regimen before the age of 6 months, and will then be followed up for two years (to age 30 months) to record mortality. In the control arm, infants will receive a placebo vaccine and will be followed up in the same way.
Current child mortality in the age-range 6-29 months is estimated to be 35 per 1000 child-years, and you estimate that this is likely to vary between 25 and 45 per 1000 child-years in most villages. You wish to have 80% power of detecting a 20% reduction in all-cause mortality due to the pneumococcal vaccine. Each year, around 100 infants reach the age of six months in each village, and you plan to recruit infants to the trial for three years.
Estimate roughly how many villages you will need to recruit to each arm of the trial. What is the design effect?
Show the code
library(haven) # create tablelibrary(gtsummary)# data wrangling and other functionslibrary(tidyverse)# if you want to have a output similar to statalibrary(tidylog)#coxlibrary(survival)# Limit significant digits to 3, remove scientific notationoptions(digits =3, scipen =9)
Show the code
t <-3m <-100g <-2lc <-0.035le <-0.035*0.8alpha <-0.05power <-0.8CV <-0.14 T <- t * m * g IFt <- ((lc +le)/ T + (CV^2) * (lc^2+ le^2)) f <- (qnorm(1- alpha/2) +qnorm(power))^2 sample <- (IFt * f)/(lc - le)^2+1
Q2
This question uses data from the Gambian National Impregnated Bednet Programme, datasets GAMINDIV.DTA and GAMVILL.DTA. The data are from a cross-sectional survey of children aged 1-4 years conducted at the end of the transmission season in a sample of villages from each treatment arm. See course manual for further details.
Estimate the overall parasite prevalence for each group: (i) by summing over all children (GAMINDIV.DTA); and (ii) by averaging village prevalences (GAMVILL.DTA). Is there any difference between these two estimates?
Test the hypothesis that net impregnation reduces the prevalence of malaria parasites using the standard chi-squared test, ignoring the randomisation by village (use GAMINDIV.DTA).
Pearson's Chi-squared test with Yates' continuity correction
data: tab1
X-squared = 6, df = 1, p-value = 0.01
Test the same null hypothesis, but allowing for randomisation by village, by using (i) a t-test on village prevalences; (ii) a Wilcoxon rank sum test on village prevalences.
Show the code
t.test(rate~group, data=dt_cluster,var.equal = T)
Two Sample t-test
data: rate by group
t = -2, df = 32, p-value = 0.06
alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
95 percent confidence interval:
-0.23729 0.00673
sample estimates:
mean in group 1 mean in group 2
0.277 0.392
Wilcoxon rank sum test
data: rate by group
W = 96, p-value = 0.09
alternative hypothesis: true location shift is not equal to 0
Coding explanation
The default value for t.test in R is considering unequal variances (while in stata is considering equal variances), that’s why we need to include “var.equal=T” to get the same answers. Similarly, for the Wilcoxon, by default it applies a continuity correction in the normal approximation for the p-value (not by default in stata)
Compare the results you obtain from (d) with your unadjusted result from (c). How does the adjustment for clustering change your interpretation of the effect of bednet impregnation on parasite status?
Answer
The standard \(\chi^2\) test shows a highly significant effect of treatment, and is invalid! The other tests, which allow for the clustered design, give larger P-values indicating only weak evidence that the treatment has an effect on parasite status.
OPTIONAL: Obtain a 95% confidence interval for the ratio of parasite prevalence in the intervention and control arms.
Show the code
#get the risks# dt_ind |> group_by(group) |> # count(para) |> # mutate(risk=n/sum(n)) |> # filter(para==1) |> # ungroup() |> # mutate(rr = risk[group=="1"]/risk[group=="2"])# Risk in each armr0 <-0.367r1 <-0.298# Risk ratio RR <- r1/r0logRR <-log(RR)# Standard deviations of rates in each group (by cluster )#dt_cluster |> group_by(group) |> summarise(sd(rate)) - get the SD s1 <-0.15s0 <-0.196# Numbers of clustersc1 <-17c0 <-17# Variances of risksvar_r1 <- (s1^2)/c1var_r0 <- (s0^2)/c0# Variance of the log RRvar_logRR <- var_r1/(r1^2) + var_r0/(r0^2)# Confidence intervalslower_CI <-exp(logRR -1.96*sqrt(var_logRR))upper_CI <-exp(logRR +1.96*sqrt(var_logRR))data.frame(rr= RR, lcl = lower_CI, ucl = upper_CI)
rr
lcl
ucl
0.812
0.573
1.15
Using GAMINDIV.DTA, reanalyse the data from this trial using GEE and random effects logistic regression. How do your results compare with those from (d)?
Show the code
library(geepack)library(lme4)dt_ind <- dt_ind |>mutate(para_bin = (if_else(para==1,1,0)),vid_fac =as.factor(vid),group_bin =as.factor(if_else(group==2,0,1))) dt_ind_order <- dt_ind |>arrange(vid_fac) # geeglm (geepack) assumes that the observations from the same cluster appears contiguousmod_unorder <-geeglm(para_bin~group_bin,id = vid_fac,data=dt_ind,corstr ="exchangeable",family =binomial(link="logit"))mod_order <-geeglm(para_bin~group_bin,id = vid_fac,data=(dt_ind_order),corstr ="exchangeable",family =binomial(link="logit"))#broom::tidy(mod_unorder, exponentiate=T) #check that it gives the wrong answerbroom::tidy(mod_order, exponentiate = T) |> gt::gt()
term
estimate
std.error
statistic
p.value
(Intercept)
0.635
0.196
5.37
0.0205
group_bin1
0.610
0.264
3.50
0.0614
Show the code
# geeM doesnt require ordering# mod <- geeM::geem(para_bin~group_bin,# id = vid_fac,# data=as.data.frame(dt_ind),# corstr = "exchangeable",# family = binomial(link="logit"))# summary(mod)mod <-glmer(para_bin~group_bin + (1|vid),data=dt_ind,family ="binomial")broom.mixed::tidy(mod, exponentiate=T, effects="fixed") |> gt::gt()
effect
term
estimate
std.error
statistic
p.value
fixed
(Intercept)
0.593
0.124
-2.50
0.0124
fixed
group_bin1
0.584
0.174
-1.81
0.0705
Q3
Dataset MZTRIAL.DTA contains individual-level data from the Mwanza trial of the impact of improved STD treatment services on HIV incidence. There were 12 communities in 6 matched pairs, and a random cohort of around 1000 adults aged 15-54 years was followed up in each community. The dataset contains data on HIV seroconversion among those individuals who were seronegative at baseline and who were successfully followed up after two years. We are going to analyse the results of the trial with and without adjustment for confounding variables (see Section 6).
Using the following commands, first use logistic regression to adjust the results for possible confounding variables, including age, sex, matched pair and the baseline HIV prevalence in each community:
model = glm(hiv ~ agegp + sex + pair + hivbase)predict(model, type="response")
Show the code
mod_hiv_prob <-glm(hiv ~ agegp + sex + pair + hivbase,family = binomial,data = dt_mzt)dt_mzt$fitted_hiv_prob <-predict(mod_hiv_prob, type ="response")
The predict takes a model and give the prediction using a new dataset. In the case of no “newdata” being declared, it will use the dataset used for fitting the model. The type="response" is related to give the prediction in the response scale, i.e., in the probability scale
Now use collapse to sum up the observed and fitted values for each of the 12 communities:
Unadjusted analysis: Compute the unadjusted RR for each matched pair (obtained using hiv and n from (b)). Carry out a paired t-test on the log(RR). Using the mean of the log(RR), find the geometric mean (remember: t test of the log-transformed data is a test of differences between geometric means) of the RR over all matched pairs, and obtain a 95% confidence interval for this. Adjusted analysis: Repeat the above calculations on the adjusted RR for each matched pair (obtained using hiv and fv from (b)).