All code solutions are hidden by default, you can see the underlying code clicking in “Show the code”
Q1
If you were to investigate the association between HIV status and (lifetime) number of sexual partners (npa) you could treat npa as a factor with 4 levels. Using the terminology in section 5 (pages 10-12) of the lecture notes, write down a model which could be fitted using logistic regression and explain what your terms are.
Suppose you wanted to explore the joint effect of npa with years of schooling (ed2), treating the latter as a 2 level variable (none vs some). Write down a model which includes npa, ed2 and their interaction and describe what the terms in your model would be.
# Data wrangling# read "dta" fileslibrary(haven) # create tablelibrary(gtsummary)# data wrangling and other functionslibrary(tidyverse)# if you want to have a output similar to statalibrary(tidylog)# Limit significant digits to 2, remove scientific notationoptions(digits =2, scipen =9)
Show the code
# make sure the dataset is in your working directory# Data importmwanza <-read_dta("datasets/MWANZA.DTA")# Data tidying# Recode missing values# look at the documentation of across https://dplyr.tidyverse.org/reference/across.htmlmwanza <- mwanza |>mutate(across(c( ud, rel, bld, npa, pa1, eth, inj, msta, skin, fsex, usedc),~na_if(.x,9) ) )# Create a vector of categorical variable namesmwanza <- mwanza |>mutate(across(c( ud, rel, bld, npa, pa1, eth, inj, msta, skin, fsex, usedc),~na_if(.x,9) ) )# Make them all categorical# every variable except id mwanza <- mwanza |>mutate(across(-c(idno), as.factor ) )# Create a new variable, relevel and labelmwanza <- mwanza |>mutate(age2 =as.factor(case_when( age1 =="1"| age1 =="2"~"15-24", age1 =="3"| age1 =="4"~"25-34", age1 =="5"| age1 =="6"~"35+")),age2 =fct_relevel(age2, "15-24"))
2a
Obtain a frequency table of npa. What is the most commonly occurring number of lifetime sexual partners? Where npa is missing (9) recode to STATA’s own missing value code (.) using the command mvdecode or recode. Form a cross-tabulation of number of lifetime sexual partners with HIV status.
Show the code
mwanza |>tbl_summary(include =c(npa))
Characteristic
N = 7631
npa
1
200 (27%)
2
369 (50%)
3
123 (17%)
4
43 (5.9%)
Unknown
28
1 n (%)
What is the most common number of lifetime sexual partners?
Cross-tabulate number of lifetime sexual partners with HIV status.
Show the code
mwanza |>tbl_summary(include =c(npa), by = case)
Characteristic
0 N = 5741
1 N = 1891
npa
1
173 (31%)
27 (15%)
2
277 (50%)
92 (50%)
3
83 (15%)
40 (22%)
4
19 (3.4%)
24 (13%)
Unknown
22
6
1 n (%)
2b
Fit a logistic model to estimate the strength of association between npa and HIV status, treating npa as a factor. Is there evidence for an association between npa and HIV? What do you conclude?
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
Is there evidence of association?
Answer
There is strong evidence of an association between the number of lifetime partners and being HIV positive.
2c
A more convenient way to fit this model is to use the most prevalent group as a baseline. Which group was used in (b)? Why does it make sense to use the most prevalent group as baseline? Refit the model using the most prevalent group as baseline.
Show the code
# Relevel the factormwanza <- mwanza |># the first argument of fct_relevel is the variable and the second argument the level you # want to be the reference levelmutate(npa =fct_relevel(npa,"2")) # Logistic regression (unchanged)glm(case ~ npa,family = binomial,data = mwanza) |>tbl_regression(exponentiate = T) |>add_global_p(keep=T)
Characteristic
OR
95% CI
p-value
npa
<0.001
2
—
—
1
0.47
0.29, 0.74
0.002
3
1.45
0.92, 2.26
0.10
4
3.80
2.00, 7.34
<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
Answer
If we use the most prevalent group as the baseline, then the SEs for the \(\beta\)s (log(OR)) will be smaller.
2d
Amend your model to control for the confounding effect of age treated as a factor (age1).
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
Show the code
# if you want to see the more details about the overall signficance of each variable# anova(glm(case ~ npa + age1,# family = binomial,# data = mwanza))
What is your conclusion?
Answer
Adding age1 to the model changes the odds ratios for npa (0.51, 1.30, 4.75) showing the confounding effect of age. After adjusting for age, there is strong evidence that npa is associated with being a case (LRT gives X32=35.44, p<0.001).
2e. Summary table
Summarise the results of the analyses conducted above in a table - show the distribution of number of partners in cases and controls, odds ratios (unadjusted and age-adjusted effect of number of lifetime partners in a table), 95% CIs, p-values.
Show the code
# you can do all three itens in almost one line of code (in R)library(finalfit)explanatory =c("npa", "age1")dependent ="case"mwanza |>finalfit(dependent, explanatory) -> table1knitr::kable(table1, row.names=FALSE, align=c("l", "l", "r", "r", "r"))
Dependent: case
0
1
OR (univariable)
OR (multivariable)
npa
2
277 (75.1)
92 (24.9)
-
-
1
173 (86.5)
27 (13.5)
0.47 (0.29-0.74, p=0.002)
0.51 (0.31-0.81, p=0.006)
3
83 (67.5)
40 (32.5)
1.45 (0.92-2.26, p=0.101)
1.30 (0.82-2.04, p=0.264)
4
19 (44.2)
24 (55.8)
3.80 (2.00-7.34, p<0.001)
4.75 (2.43-9.47, p<0.001)
age1
1
96 (88.1)
13 (11.9)
-
-
2
108 (65.5)
57 (34.5)
3.90 (2.07-7.84, p<0.001)
3.27 (1.69-6.69, p=0.001)
3
84 (68.3)
39 (31.7)
3.43 (1.75-7.08, p<0.001)
2.50 (1.24-5.28, p=0.013)
4
85 (72.0)
33 (28.0)
2.87 (1.45-5.98, p=0.003)
1.93 (0.93-4.15, p=0.082)
5
107 (78.1)
30 (21.9)
2.07 (1.04-4.32, p=0.044)
1.26 (0.61-2.73, p=0.539)
6
94 (84.7)
17 (15.3)
1.34 (0.62-2.95, p=0.465)
0.81 (0.35-1.87, p=0.612)
Q3
3a
The risk of HIV associated with npa is confounded by attending school (using ed2)
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
Answer
Educational status does not appear to be a strong confounder of the relationship between npa and HIV-infection, after adjusting for age1 as the adjusted ORs are 0.50, 1.24, and 4.38, in similar to the adjusted ORs reported in 2
3b.
There is any evidence the risk of HIV associated with npa differs according to whether the women had attended school.
Show the code
# Model with interactionlogit_inter <-glm(case ~ npa * ed2 + age1,family = binomial,data = mwanza)# Model without interactionlogit_without <-glm(case ~ npa + ed2 + age1,family = binomial,data = mwanza)# Likelihood ratio testlmtest::lrtest(logit_without, logit_inter)
#Df
LogLik
Df
Chisq
Pr(>Chisq)
10
-374
NA
NA
NA
13
-373
3
0.5
0.92
Show the code
# Note that ANOVA gives you the same χ statistic and df#anova(logit_without, logit_inter)
Answer
LRT gives \(\chi_{3}^2\)=0.50, p=0.92 suggesting data are compatible with no interaction between lifetime partners and educational status on HIV status.
Q4
We will now look at the potential interaction effect between age1 and npa. This is a more complex situation because both have more than 2 levels. Ignore ed2 to keep the model simpler.
4a
Fit a model including npa, age1 and their interaction, all terms treated as categorical variables.
What happens when you carry out a LR test of the interaction term?
Show the code
# You will get multiple warnings of non convergence / fitted probabilities of 0/1glm(case ~ npa * age1,family = binomial,data = mwanza)|>tbl_regression(exponentiate = T) |>add_global_p(keep=T)
Characteristic
OR
95% CI
p-value
npa
0.025
2
—
—
1
0.44
0.10, 1.78
0.2
3
6.40
1.19, 36.7
0.030
4
0.00
>0.9
age1
<0.001
1
—
—
2
3.98
1.52, 12.6
0.009
3
3.20
1.14, 10.5
0.037
4
2.23
0.77, 7.44
0.2
5
1.60
0.55, 5.32
0.4
6
0.70
0.20, 2.60
0.6
npa * age1
0.2
1 * 2
1.06
0.20, 5.62
>0.9
3 * 2
0.22
0.03, 1.43
0.11
4 * 2
97,773
0.00,
>0.9
1 * 3
0.62
0.08, 4.12
0.6
3 * 3
0.18
0.03, 1.20
0.076
4 * 3
1,217,553
0.00,
>0.9
1 * 4
1.62
0.24, 10.5
0.6
3 * 4
0.18
0.02, 1.33
0.095
4 * 4
523,548
0.00,
>0.9
1 * 5
0.70
0.07, 5.21
0.7
3 * 5
0.10
0.01, 0.81
0.035
4 * 5
1,095,798
0.00,
>0.9
1 * 6
5.19
0.77, 35.7
0.088
3 * 6
0.20
0.01, 2.81
0.3
4 * 6
478,324
0.00,
>0.9
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
Tabulate the number of cases in each grouping of npa vs age1 (see note below).
Show the code
mwanza |>tbl_summary(include =c(age1), by = npa)
Characteristic
2 N = 3691
1 N = 2001
3 N = 1231
4 N = 431
age1
1
37 (10%)
62 (31%)
8 (6.5%)
1 (2.3%)
2
86 (23%)
40 (20%)
28 (23%)
3 (7.0%)
3
57 (15%)
25 (13%)
33 (27%)
6 (14%)
4
58 (16%)
20 (10%)
24 (20%)
10 (23%)
5
70 (19%)
28 (14%)
22 (18%)
13 (30%)
6
61 (17%)
25 (13%)
8 (6.5%)
10 (23%)
1 n (%)
What is the problem and how can it be resolved?
Answer
This problem arises because there are no cases in the highest sex partner group and the youngest age group. Sparse data (bias), other names is complete/quasi separation problem.
4b
In order to test for interaction we need to combine npa group 4 with npa group 3. Generate a new variable (eg npa2) with this regrouping. Refit the interaction model using npa2 in place of npa. Is there evidence of any interaction?
Show the code
# Create a new variable, relevel and labelmwanza <- mwanza |>mutate(partners =case_when(npa =="1"~"<=1", npa =="2"~"2-4", npa =="3"| npa =="4"~">=5"),partners =fct_relevel(partners,"2-4") )# Check it worked wellmwanza |>tbl_summary(include =c(age1), by = partners)
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
Show the code
# Model with interactionlogit_inter <-glm(case ~ partners * age1,family = binomial,data = mwanza)# Model without interactionlogit_without <-glm(case ~ partners + age1,family = binomial,data = mwanza)# Likelihood ratio testlmtest::lrtest(logit_without, logit_inter)
#Df
LogLik
Df
Chisq
Pr(>Chisq)
8
-385
NA
NA
NA
18
-380
10
9.4
0.49
Answer
Using partners variable in the analysis and conducting the interaction and no interaction model, the data are compatible with null hypothesis of no interaction: LRT gives \(\chi_{10}^2\)=9.43, p=0.49.
Q5
What other possible workarounds can you come up with for the issue identified in 4a.?
Answer
One alternative is to treat both npa and age1 as trend terms (ordinal values) for the interaction term
Q6
Estimates of Partners by age group with the interaction
Show the code
library(marginaleffects)mod <-glm(case ~ partners * age1,family = binomial,data = mwanza)knitr::kable(avg_comparisons(mod, variables ="partners",by ="age1",comparison ="lnor", # do the comparison in Log Oddstransform ="exp")#back transform to odds ratio )