2: Review of Logistic Regression

Author

Thiago Cerqueira Silva

All code solutions are hidden by default, you can see the underlying code clicking in “Show the code”

Q1

If you were to investigate the association between HIV status and (lifetime) number of sexual partners (npa) you could treat npa as a factor with 4 levels. Using the terminology in section 5 (pages 10-12) of the lecture notes, write down a model which could be fitted using logistic regression and explain what your terms are.

Suppose you wanted to explore the joint effect of npa with years of schooling (ed2), treating the latter as a 2 level variable (none vs some). Write down a model which includes npa, ed2 and their interaction and describe what the terms in your model would be.

Answer

\[\log\left[ \frac { P( \operatorname{case} = \operatorname{1} ) }{ 1 - P( \operatorname{case} = \operatorname{1} ) } \right] = \alpha + \beta_{1}(\operatorname{npa})\] \[\log\left[ \frac { P( \operatorname{case} = \operatorname{1} ) }{ 1 - P( \operatorname{case} = \operatorname{1} ) } \right] = \alpha + \beta_{1}(\operatorname{npa})+\beta_{2}(\operatorname{ed2})\]

Q2

Computer praticals

Show the code

# Data wrangling
# read "dta" files
library(haven) 
# create table
library(gtsummary)
# data wrangling and other functions
library(tidyverse)
# if you want to have a output similar to stata
library(tidylog)

# Limit significant digits to 2, remove scientific notation
options(digits = 2, scipen = 9)

Show the code

# make sure the dataset is in your working directory
# Data import
mwanza <- read_dta("datasets/MWANZA.DTA")
# Data tidying
# Recode missing values
# look at the documentation of across https://dplyr.tidyverse.org/reference/across.html
mwanza <- mwanza |> 
  mutate(
    across(
      c(
        ud,
        rel,
        bld,
        npa,
        pa1,
        eth,
        inj,
        msta,
        skin,
        fsex,
        usedc),
      ~na_if(.x,9)
      )
    )


# Create a vector of categorical variable names
mwanza <- mwanza |> 
  mutate(
    across(
      c(
        ud,
        rel,
        bld,
        npa,
        pa1,
        eth,
        inj,
        msta,
        skin,
        fsex,
        usedc),
      ~na_if(.x,9)
      )
    )


# Make them all categorical
# every variable except id 
mwanza <- mwanza |> 
  mutate(
    across(
      -c(idno),
      as.factor
      )
    )

# Create a new variable, relevel and label
mwanza <- mwanza |> 
  mutate(age2 =  as.factor(
    case_when(
      age1 == "1" | age1 == "2"  ~ "15-24",
      age1 == "3" | age1 == "4" ~ "25-34",
      age1 == "5" | age1 == "6" ~ "35+")),
    age2 = fct_relevel(age2, "15-24"))

2a

Obtain a frequency table of npa. What is the most commonly occurring number of lifetime sexual partners? Where npa is missing (9) recode to STATA’s own missing value code (.) using the command mvdecode or recode. Form a cross-tabulation of number of lifetime sexual partners with HIV status.

Show the code

mwanza |> tbl_summary(include = c(npa))

Characteristic	N = 763¹
npa
1	200 (27%)
2	369 (50%)
3	123 (17%)
4	43 (5.9%)
Unknown	28
¹ n (%)

What is the most common number of lifetime sexual partners?

Cross-tabulate number of lifetime sexual partners with HIV status.

Show the code

mwanza |> tbl_summary(include = c(npa), by = case)

Characteristic	0 N = 574¹	1 N = 189¹
npa
1	173 (31%)	27 (15%)
2	277 (50%)	92 (50%)
3	83 (15%)	40 (22%)
4	19 (3.4%)	24 (13%)
Unknown	22	6
¹ n (%)

2b

Fit a logistic model to estimate the strength of association between npa and HIV status, treating npa as a factor. Is there evidence for an association between npa and HIV? What do you conclude?

Show the code

glm(case ~ npa,
    family = binomial,
    data = mwanza) |> 
  tbl_regression(exponentiate = T) |> 
  add_global_p(keep=T)

Characteristic	OR	95% CI	p-value
npa			<0.001
1	—	—
2	2.13	1.35, 3.46	0.002
3	3.09	1.78, 5.42	<0.001
4	8.09	3.95, 17.0	<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Is there evidence of association?

Answer

There is strong evidence of an association between the number of lifetime partners and being HIV positive.

2c

A more convenient way to fit this model is to use the most prevalent group as a baseline. Which group was used in (b)? Why does it make sense to use the most prevalent group as baseline? Refit the model using the most prevalent group as baseline.

Show the code

# Relevel the factor
mwanza <- mwanza |> 
  # the first argument of fct_relevel is the variable and the second argument the level you 
  # want to be the reference level
  mutate(npa = fct_relevel(npa,"2")) 

# Logistic regression (unchanged)
glm(case ~ npa,
    family = binomial,
    data = mwanza) |> 
  tbl_regression(exponentiate = T) |> 
  add_global_p(keep=T)

Characteristic	OR	95% CI	p-value
npa			<0.001
2	—	—
1	0.47	0.29, 0.74	0.002
3	1.45	0.92, 2.26	0.10
4	3.80	2.00, 7.34	<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Answer

If we use the most prevalent group as the baseline, then the SEs for the \(\beta\)s (log(OR)) will be smaller.

2d

Amend your model to control for the confounding effect of age treated as a factor (age1).

Show the code

glm(case ~ npa + age1,
    family = binomial,
    data = mwanza) |> 
  tbl_regression(exponentiate = T) |> 
  add_global_p(keep=T)

Characteristic	OR	95% CI	p-value
npa			<0.001
2	—	—
1	0.51	0.31, 0.81	0.006
3	1.30	0.82, 2.04	0.3
4	4.75	2.43, 9.47	<0.001
age1			<0.001
1	—	—
2	3.27	1.69, 6.69	<0.001
3	2.50	1.24, 5.28	0.013
4	1.93	0.93, 4.15	0.082
5	1.26	0.61, 2.73	0.5
6	0.81	0.35, 1.87	0.6
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Show the code

# if you want to see the more details about the overall signficance of each variable
# anova(glm(case ~ npa + age1,
#     family = binomial,
#     data = mwanza))

What is your conclusion?

Answer

Adding age1 to the model changes the odds ratios for npa (0.51, 1.30, 4.75) showing the confounding effect of age. After adjusting for age, there is strong evidence that npa is associated with being a case (LRT gives X32=35.44, p<0.001).

2e. Summary table

Summarise the results of the analyses conducted above in a table - show the distribution of number of partners in cases and controls, odds ratios (unadjusted and age-adjusted effect of number of lifetime partners in a table), 95% CIs, p-values.

Show the code

# you can do all three itens in almost one line of code (in R)

library(finalfit)

explanatory = c("npa", "age1")
dependent = "case"
mwanza |> 
    finalfit(dependent, explanatory) -> table1
knitr::kable(table1, row.names=FALSE, align=c("l", "l", "r", "r", "r"))

Dependent: case		0	1	OR (univariable)	OR (multivariable)
npa	2	277 (75.1)	92 (24.9)	-	-
	1	173 (86.5)	27 (13.5)	0.47 (0.29-0.74, p=0.002)	0.51 (0.31-0.81, p=0.006)
	3	83 (67.5)	40 (32.5)	1.45 (0.92-2.26, p=0.101)	1.30 (0.82-2.04, p=0.264)
	4	19 (44.2)	24 (55.8)	3.80 (2.00-7.34, p<0.001)	4.75 (2.43-9.47, p<0.001)
age1	1	96 (88.1)	13 (11.9)	-	-
	2	108 (65.5)	57 (34.5)	3.90 (2.07-7.84, p<0.001)	3.27 (1.69-6.69, p=0.001)
	3	84 (68.3)	39 (31.7)	3.43 (1.75-7.08, p<0.001)	2.50 (1.24-5.28, p=0.013)
	4	85 (72.0)	33 (28.0)	2.87 (1.45-5.98, p=0.003)	1.93 (0.93-4.15, p=0.082)
	5	107 (78.1)	30 (21.9)	2.07 (1.04-4.32, p=0.044)	1.26 (0.61-2.73, p=0.539)
	6	94 (84.7)	17 (15.3)	1.34 (0.62-2.95, p=0.465)	0.81 (0.35-1.87, p=0.612)

Q3

3a

The risk of HIV associated with npa is confounded by attending school (using ed2)

Show the code

glm(case ~ npa + age1 + ed2,
    family = binomial,
    data = mwanza) |> 
  tbl_regression(exponentiate = T) |> 
  add_global_p(keep=T)

Characteristic	OR	95% CI	p-value
npa			<0.001
2	—	—
1	0.50	0.30, 0.81	0.006
3	1.24	0.78, 1.96	0.4
4	4.38	2.22, 8.82	<0.001
age1			0.003
1	—	—
2	3.20	1.66, 6.56	<0.001
3	2.62	1.30, 5.56	0.009
4	2.28	1.09, 4.97	0.031
5	1.59	0.75, 3.49	0.2
6	1.26	0.52, 3.06	0.6
ed2			0.001
0	—	—
1	1.97	1.30, 3.04	0.002
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Answer

Educational status does not appear to be a strong confounder of the relationship between npa and HIV-infection, after adjusting for age1 as the adjusted ORs are 0.50, 1.24, and 4.38, in similar to the adjusted ORs reported in 2

3b.

There is any evidence the risk of HIV associated with npa differs according to whether the women had attended school.

Show the code

# Model with interaction
logit_inter <- glm(case ~ npa * ed2 + age1,
    family = binomial,
    data = mwanza)

# Model without interaction
logit_without <- glm(case ~ npa + ed2 + age1,
    family = binomial,
    data = mwanza)

# Likelihood ratio test
lmtest::lrtest(logit_without, logit_inter)

#Df	LogLik	Df	Chisq	Pr(>Chisq)
10	-374	NA	NA	NA
13	-373	3	0.5	0.92

Show the code

# Note that ANOVA gives you the same χ statistic and df
#anova(logit_without, logit_inter)

Answer

LRT gives \(\chi_{3}^2\)=0.50, p=0.92 suggesting data are compatible with no interaction between lifetime partners and educational status on HIV status.

Q4

We will now look at the potential interaction effect between age1 and npa. This is a more complex situation because both have more than 2 levels. Ignore ed2 to keep the model simpler.

4a

Fit a model including npa, age1 and their interaction, all terms treated as categorical variables.

What happens when you carry out a LR test of the interaction term?

Show the code

# You will get multiple warnings of non convergence / fitted probabilities of 0/1
glm(case ~ npa * age1,
    family = binomial,
    data = mwanza)|> 
  tbl_regression(exponentiate = T) |> 
  add_global_p(keep=T)

Characteristic	OR	95% CI	p-value
npa			0.025
2	—	—
1	0.44	0.10, 1.78	0.2
3	6.40	1.19, 36.7	0.030
4	0.00		>0.9
age1			<0.001
1	—	—
2	3.98	1.52, 12.6	0.009
3	3.20	1.14, 10.5	0.037
4	2.23	0.77, 7.44	0.2
5	1.60	0.55, 5.32	0.4
6	0.70	0.20, 2.60	0.6
npa * age1			0.2
1 * 2	1.06	0.20, 5.62	>0.9
3 * 2	0.22	0.03, 1.43	0.11
4 * 2	97,773	0.00,	>0.9
1 * 3	0.62	0.08, 4.12	0.6
3 * 3	0.18	0.03, 1.20	0.076
4 * 3	1,217,553	0.00,	>0.9
1 * 4	1.62	0.24, 10.5	0.6
3 * 4	0.18	0.02, 1.33	0.095
4 * 4	523,548	0.00,	>0.9
1 * 5	0.70	0.07, 5.21	0.7
3 * 5	0.10	0.01, 0.81	0.035
4 * 5	1,095,798	0.00,	>0.9
1 * 6	5.19	0.77, 35.7	0.088
3 * 6	0.20	0.01, 2.81	0.3
4 * 6	478,324	0.00,	>0.9
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Tabulate the number of cases in each grouping of npa vs age1 (see note below).

Show the code

mwanza |> tbl_summary(include = c(age1), by = npa)

Characteristic	2 N = 369¹	1 N = 200¹	3 N = 123¹	4 N = 43¹
age1
1	37 (10%)	62 (31%)	8 (6.5%)	1 (2.3%)
2	86 (23%)	40 (20%)	28 (23%)	3 (7.0%)
3	57 (15%)	25 (13%)	33 (27%)	6 (14%)
4	58 (16%)	20 (10%)	24 (20%)	10 (23%)
5	70 (19%)	28 (14%)	22 (18%)	13 (30%)
6	61 (17%)	25 (13%)	8 (6.5%)	10 (23%)
¹ n (%)

What is the problem and how can it be resolved?

Answer

This problem arises because there are no cases in the highest sex partner group and the youngest age group. Sparse data (bias), other names is complete/quasi separation problem.

4b

In order to test for interaction we need to combine npa group 4 with npa group 3. Generate a new variable (eg npa2) with this regrouping. Refit the interaction model using npa2 in place of npa. Is there evidence of any interaction?

Show the code

# Create a new variable, relevel and label
mwanza <- mwanza |> 
  mutate(partners = case_when(npa == "1" ~ "<=1",
              npa == "2" ~ "2-4",
              npa == "3" | npa == "4" ~ ">=5"),
    partners = fct_relevel(partners,"2-4")
  )

# Check it worked well
mwanza |> tbl_summary(include = c(age1), by = partners)

Characteristic	2-4 N = 369¹	<=1 N = 200¹	>=5 N = 166¹
age1
1	37 (10%)	62 (31%)	9 (5.4%)
2	86 (23%)	40 (20%)	31 (19%)
3	57 (15%)	25 (13%)	39 (23%)
4	58 (16%)	20 (10%)	34 (20%)
5	70 (19%)	28 (14%)	35 (21%)
6	61 (17%)	25 (13%)	18 (11%)
¹ n (%)

Show the code

glm(case ~ partners * age1,
    family = binomial,
    data = mwanza)|> 
  tbl_regression(exponentiate = T) |> 
  add_global_p(keep=T)

Characteristic	OR	95% CI	p-value
partners			0.018
2-4	—	—
<=1	0.44	0.10, 1.78	0.2
>=5	5.12	0.99, 27.1	0.048
age1			<0.001
1	—	—
2	3.98	1.52, 12.6	0.009
3	3.20	1.14, 10.5	0.037
4	2.23	0.77, 7.44	0.2
5	1.60	0.55, 5.32	0.4
6	0.70	0.20, 2.60	0.6
partners * age1			0.5
<=1 * 2	1.06	0.20, 5.62	>0.9
>=5 * 2	0.26	0.04, 1.62	0.14
<=1 * 3	0.62	0.08, 4.12	0.6
>=5 * 3	0.30	0.05, 1.90	0.2
<=1 * 4	1.62	0.24, 10.5	0.6
>=5 * 4	0.35	0.05, 2.25	0.3
<=1 * 5	0.70	0.07, 5.21	0.7
>=5 * 5	0.41	0.06, 2.65	0.3
<=1 * 6	5.19	0.77, 35.7	0.088
>=5 * 6	0.51	0.06, 4.33	0.5
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Show the code

# Model with interaction
logit_inter <- glm(case ~ partners * age1,
    family = binomial,
    data = mwanza)

# Model without interaction
logit_without <- glm(case ~ partners + age1,
    family = binomial,
    data = mwanza)

# Likelihood ratio test
lmtest::lrtest(logit_without, logit_inter)

#Df	LogLik	Df	Chisq	Pr(>Chisq)
8	-385	NA	NA	NA
18	-380	10	9.4	0.49

Answer

Using partners variable in the analysis and conducting the interaction and no interaction model, the data are compatible with null hypothesis of no interaction: LRT gives \(\chi_{10}^2\)=9.43, p=0.49.

Q5

What other possible workarounds can you come up with for the issue identified in 4a.?

Answer

One alternative is to treat both npa and age1 as trend terms (ordinal values) for the interaction term

Q6

Estimates of Partners by age group with the interaction

Show the code

library(marginaleffects)
mod <- glm(case ~ partners * age1,
    family = binomial,
    data = mwanza)

knitr::kable(avg_comparisons(mod, variables = "partners",
                by = "age1",
                comparison = "lnor", # do the comparison in Log Odds
                transform = "exp")#back transform to odds ratio
      )

term	contrast	age1	estimate	p.value	s.value	conf.low	conf.high
partners	ln(odds(<=1) / odds(2-4))	1	0.44	0.25	2.02	0.11	1.8
partners	ln(odds(>=5) / odds(2-4))	1	5.12	0.05	4.39	1.02	25.8
partners	ln(odds(<=1) / odds(2-4))	2	0.47	0.08	3.61	0.20	1.1
partners	ln(odds(>=5) / odds(2-4))	2	1.32	0.51	0.97	0.58	3.0
partners	ln(odds(<=1) / odds(2-4))	3	0.27	0.05	4.19	0.07	1.0
partners	ln(odds(>=5) / odds(2-4))	3	1.55	0.31	1.69	0.67	3.6
partners	ln(odds(<=1) / odds(2-4))	4	0.72	0.60	0.74	0.21	2.5
partners	ln(odds(>=5) / odds(2-4))	4	1.77	0.22	2.21	0.72	4.4
partners	ln(odds(<=1) / odds(2-4))	5	0.31	0.14	2.87	0.07	1.4
partners	ln(odds(>=5) / odds(2-4))	5	2.09	0.11	3.14	0.84	5.2
partners	ln(odds(<=1) / odds(2-4))	6	2.29	0.21	2.26	0.63	8.3
partners	ln(odds(>=5) / odds(2-4))	6	2.62	0.18	2.51	0.65	10.6