2: Review of Logistic Regression

Author

Thiago Cerqueira Silva

All code solutions are hidden by default, you can see the underlying code clicking in “Show the code”

Q1

If you were to investigate the association between HIV status and (lifetime) number of sexual partners (npa) you could treat npa as a factor with 4 levels. Using the terminology in section 5 (pages 10-12) of the lecture notes, write down a model which could be fitted using logistic regression and explain what your terms are.

Suppose you wanted to explore the joint effect of npa with years of schooling (ed2), treating the latter as a 2 level variable (none vs some). Write down a model which includes npa, ed2 and their interaction and describe what the terms in your model would be.

\[\log\left[ \frac { P( \operatorname{case} = \operatorname{1} ) }{ 1 - P( \operatorname{case} = \operatorname{1} ) } \right] = \alpha + \beta_{1}(\operatorname{npa})\] \[\log\left[ \frac { P( \operatorname{case} = \operatorname{1} ) }{ 1 - P( \operatorname{case} = \operatorname{1} ) } \right] = \alpha + \beta_{1}(\operatorname{npa})+\beta_{2}(\operatorname{ed2})\]

Q2

Computer praticals

Show the code
# Data wrangling
# read "dta" files
library(haven) 
# create table
library(gtsummary)
# data wrangling and other functions
library(tidyverse)
# if you want to have a output similar to stata
library(tidylog)

# Limit significant digits to 2, remove scientific notation
options(digits = 2, scipen = 9)
Show the code
# make sure the dataset is in your working directory
# Data import
mwanza <- read_dta("datasets/MWANZA.DTA")
# Data tidying
# Recode missing values
# look at the documentation of across https://dplyr.tidyverse.org/reference/across.html
mwanza <- mwanza |> 
  mutate(
    across(
      c(
        ud,
        rel,
        bld,
        npa,
        pa1,
        eth,
        inj,
        msta,
        skin,
        fsex,
        usedc),
      ~na_if(.x,9)
      )
    )


# Create a vector of categorical variable names
mwanza <- mwanza |> 
  mutate(
    across(
      c(
        ud,
        rel,
        bld,
        npa,
        pa1,
        eth,
        inj,
        msta,
        skin,
        fsex,
        usedc),
      ~na_if(.x,9)
      )
    )


# Make them all categorical
# every variable except id 
mwanza <- mwanza |> 
  mutate(
    across(
      -c(idno),
      as.factor
      )
    )

# Create a new variable, relevel and label
mwanza <- mwanza |> 
  mutate(age2 =  as.factor(
    case_when(
      age1 == "1" | age1 == "2"  ~ "15-24",
      age1 == "3" | age1 == "4" ~ "25-34",
      age1 == "5" | age1 == "6" ~ "35+")),
    age2 = fct_relevel(age2, "15-24"))

2a

  1. Obtain a frequency table of npa. What is the most commonly occurring number of lifetime sexual partners? Where npa is missing (9) recode to STATA’s own missing value code (.) using the command mvdecode or recode. Form a cross-tabulation of number of lifetime sexual partners with HIV status.
Show the code
mwanza |> tbl_summary(include = c(npa))
Characteristic N = 7631
npa
    1 200 (27%)
    2 369 (50%)
    3 123 (17%)
    4 43 (5.9%)
    Unknown 28
1 n (%)

What is the most common number of lifetime sexual partners?

Cross-tabulate number of lifetime sexual partners with HIV status.

Show the code
mwanza |> tbl_summary(include = c(npa), by = case)
Characteristic 0
N = 5741
1
N = 1891
npa

    1 173 (31%) 27 (15%)
    2 277 (50%) 92 (50%)
    3 83 (15%) 40 (22%)
    4 19 (3.4%) 24 (13%)
    Unknown 22 6
1 n (%)

2b

  1. Fit a logistic model to estimate the strength of association between npa and HIV status, treating npa as a factor. Is there evidence for an association between npa and HIV? What do you conclude?
Show the code
glm(case ~ npa,
    family = binomial,
    data = mwanza) |> 
  tbl_regression(exponentiate = T) |> 
  add_global_p(keep=T)
Characteristic OR 95% CI p-value
npa

<0.001
    1
    2 2.13 1.35, 3.46 0.002
    3 3.09 1.78, 5.42 <0.001
    4 8.09 3.95, 17.0 <0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Is there evidence of association?

There is strong evidence of an association between the number of lifetime partners and being HIV positive.

2c

A more convenient way to fit this model is to use the most prevalent group as a baseline. Which group was used in (b)? Why does it make sense to use the most prevalent group as baseline? Refit the model using the most prevalent group as baseline.

Show the code
# Relevel the factor
mwanza <- mwanza |> 
  # the first argument of fct_relevel is the variable and the second argument the level you 
  # want to be the reference level
  mutate(npa = fct_relevel(npa,"2")) 

# Logistic regression (unchanged)
glm(case ~ npa,
    family = binomial,
    data = mwanza) |> 
  tbl_regression(exponentiate = T) |> 
  add_global_p(keep=T)
Characteristic OR 95% CI p-value
npa

<0.001
    2
    1 0.47 0.29, 0.74 0.002
    3 1.45 0.92, 2.26 0.10
    4 3.80 2.00, 7.34 <0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

If we use the most prevalent group as the baseline, then the SEs for the \(\beta\)s (log(OR)) will be smaller.

2d

Amend your model to control for the confounding effect of age treated as a factor (age1).

Show the code
glm(case ~ npa + age1,
    family = binomial,
    data = mwanza) |> 
  tbl_regression(exponentiate = T) |> 
  add_global_p(keep=T)
Characteristic OR 95% CI p-value
npa

<0.001
    2
    1 0.51 0.31, 0.81 0.006
    3 1.30 0.82, 2.04 0.3
    4 4.75 2.43, 9.47 <0.001
age1

<0.001
    1
    2 3.27 1.69, 6.69 <0.001
    3 2.50 1.24, 5.28 0.013
    4 1.93 0.93, 4.15 0.082
    5 1.26 0.61, 2.73 0.5
    6 0.81 0.35, 1.87 0.6
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
Show the code
# if you want to see the more details about the overall signficance of each variable
# anova(glm(case ~ npa + age1,
#     family = binomial,
#     data = mwanza))

What is your conclusion?

Adding age1 to the model changes the odds ratios for npa (0.51, 1.30, 4.75) showing the confounding effect of age. After adjusting for age, there is strong evidence that npa is associated with being a case (LRT gives X32=35.44, p<0.001).

2e. Summary table

Summarise the results of the analyses conducted above in a table - show the distribution of number of partners in cases and controls, odds ratios (unadjusted and age-adjusted effect of number of lifetime partners in a table), 95% CIs, p-values.

Show the code
# you can do all three itens in almost one line of code (in R)

library(finalfit)

explanatory = c("npa", "age1")
dependent = "case"
mwanza |> 
    finalfit(dependent, explanatory) -> table1
knitr::kable(table1, row.names=FALSE, align=c("l", "l", "r", "r", "r"))
Dependent: case 0 1 OR (univariable) OR (multivariable)
npa 2 277 (75.1) 92 (24.9) - -
1 173 (86.5) 27 (13.5) 0.47 (0.29-0.74, p=0.002) 0.51 (0.31-0.81, p=0.006)
3 83 (67.5) 40 (32.5) 1.45 (0.92-2.26, p=0.101) 1.30 (0.82-2.04, p=0.264)
4 19 (44.2) 24 (55.8) 3.80 (2.00-7.34, p<0.001) 4.75 (2.43-9.47, p<0.001)
age1 1 96 (88.1) 13 (11.9) - -
2 108 (65.5) 57 (34.5) 3.90 (2.07-7.84, p<0.001) 3.27 (1.69-6.69, p=0.001)
3 84 (68.3) 39 (31.7) 3.43 (1.75-7.08, p<0.001) 2.50 (1.24-5.28, p=0.013)
4 85 (72.0) 33 (28.0) 2.87 (1.45-5.98, p=0.003) 1.93 (0.93-4.15, p=0.082)
5 107 (78.1) 30 (21.9) 2.07 (1.04-4.32, p=0.044) 1.26 (0.61-2.73, p=0.539)
6 94 (84.7) 17 (15.3) 1.34 (0.62-2.95, p=0.465) 0.81 (0.35-1.87, p=0.612)

Q3

3a

The risk of HIV associated with npa is confounded by attending school (using ed2)

Show the code
glm(case ~ npa + age1 + ed2,
    family = binomial,
    data = mwanza) |> 
  tbl_regression(exponentiate = T) |> 
  add_global_p(keep=T)
Characteristic OR 95% CI p-value
npa

<0.001
    2
    1 0.50 0.30, 0.81 0.006
    3 1.24 0.78, 1.96 0.4
    4 4.38 2.22, 8.82 <0.001
age1

0.003
    1
    2 3.20 1.66, 6.56 <0.001
    3 2.62 1.30, 5.56 0.009
    4 2.28 1.09, 4.97 0.031
    5 1.59 0.75, 3.49 0.2
    6 1.26 0.52, 3.06 0.6
ed2

0.001
    0
    1 1.97 1.30, 3.04 0.002
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Educational status does not appear to be a strong confounder of the relationship between npa and HIV-infection, after adjusting for age1 as the adjusted ORs are 0.50, 1.24, and 4.38, in similar to the adjusted ORs reported in 2

3b.

There is any evidence the risk of HIV associated with npa differs according to whether the women had attended school.

Show the code
# Model with interaction
logit_inter <- glm(case ~ npa * ed2 + age1,
    family = binomial,
    data = mwanza)

# Model without interaction
logit_without <- glm(case ~ npa + ed2 + age1,
    family = binomial,
    data = mwanza)

# Likelihood ratio test
lmtest::lrtest(logit_without, logit_inter)
#Df LogLik Df Chisq Pr(>Chisq)
10 -374 NA NA NA
13 -373 3 0.5 0.92
Show the code
# Note that ANOVA gives you the same χ statistic and df
#anova(logit_without, logit_inter)

LRT gives \(\chi_{3}^2\)=0.50, p=0.92 suggesting data are compatible with no interaction between lifetime partners and educational status on HIV status.

Q4

We will now look at the potential interaction effect between age1 and npa. This is a more complex situation because both have more than 2 levels. Ignore ed2 to keep the model simpler.

4a

Fit a model including npa, age1 and their interaction, all terms treated as categorical variables.

  • What happens when you carry out a LR test of the interaction term?
Show the code
# You will get multiple warnings of non convergence / fitted probabilities of 0/1
glm(case ~ npa * age1,
    family = binomial,
    data = mwanza)|> 
  tbl_regression(exponentiate = T) |> 
  add_global_p(keep=T)
Characteristic OR 95% CI p-value
npa

0.025
    2
    1 0.44 0.10, 1.78 0.2
    3 6.40 1.19, 36.7 0.030
    4 0.00
>0.9
age1

<0.001
    1
    2 3.98 1.52, 12.6 0.009
    3 3.20 1.14, 10.5 0.037
    4 2.23 0.77, 7.44 0.2
    5 1.60 0.55, 5.32 0.4
    6 0.70 0.20, 2.60 0.6
npa * age1

0.2
    1 * 2 1.06 0.20, 5.62 >0.9
    3 * 2 0.22 0.03, 1.43 0.11
    4 * 2 97,773 0.00,
>0.9
    1 * 3 0.62 0.08, 4.12 0.6
    3 * 3 0.18 0.03, 1.20 0.076
    4 * 3 1,217,553 0.00,
>0.9
    1 * 4 1.62 0.24, 10.5 0.6
    3 * 4 0.18 0.02, 1.33 0.095
    4 * 4 523,548 0.00,
>0.9
    1 * 5 0.70 0.07, 5.21 0.7
    3 * 5 0.10 0.01, 0.81 0.035
    4 * 5 1,095,798 0.00,
>0.9
    1 * 6 5.19 0.77, 35.7 0.088
    3 * 6 0.20 0.01, 2.81 0.3
    4 * 6 478,324 0.00,
>0.9
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
  • Tabulate the number of cases in each grouping of npa vs age1 (see note below).
Show the code
mwanza |> tbl_summary(include = c(age1), by = npa)
Characteristic 2
N = 3691
1
N = 2001
3
N = 1231
4
N = 431
age1



    1 37 (10%) 62 (31%) 8 (6.5%) 1 (2.3%)
    2 86 (23%) 40 (20%) 28 (23%) 3 (7.0%)
    3 57 (15%) 25 (13%) 33 (27%) 6 (14%)
    4 58 (16%) 20 (10%) 24 (20%) 10 (23%)
    5 70 (19%) 28 (14%) 22 (18%) 13 (30%)
    6 61 (17%) 25 (13%) 8 (6.5%) 10 (23%)
1 n (%)
  • What is the problem and how can it be resolved?

This problem arises because there are no cases in the highest sex partner group and the youngest age group. Sparse data (bias), other names is complete/quasi separation problem.

4b

In order to test for interaction we need to combine npa group 4 with npa group 3. Generate a new variable (eg npa2) with this regrouping. Refit the interaction model using npa2 in place of npa. Is there evidence of any interaction?

Show the code
# Create a new variable, relevel and label
mwanza <- mwanza |> 
  mutate(partners = case_when(npa == "1" ~ "<=1",
              npa == "2" ~ "2-4",
              npa == "3" | npa == "4" ~ ">=5"),
    partners = fct_relevel(partners,"2-4")
  )

# Check it worked well
mwanza |> tbl_summary(include = c(age1), by = partners)
Characteristic 2-4
N = 3691
<=1
N = 2001
>=5
N = 1661
age1


    1 37 (10%) 62 (31%) 9 (5.4%)
    2 86 (23%) 40 (20%) 31 (19%)
    3 57 (15%) 25 (13%) 39 (23%)
    4 58 (16%) 20 (10%) 34 (20%)
    5 70 (19%) 28 (14%) 35 (21%)
    6 61 (17%) 25 (13%) 18 (11%)
1 n (%)
Show the code
glm(case ~ partners * age1,
    family = binomial,
    data = mwanza)|> 
  tbl_regression(exponentiate = T) |> 
  add_global_p(keep=T)
Characteristic OR 95% CI p-value
partners

0.018
    2-4
    <=1 0.44 0.10, 1.78 0.2
    >=5 5.12 0.99, 27.1 0.048
age1

<0.001
    1
    2 3.98 1.52, 12.6 0.009
    3 3.20 1.14, 10.5 0.037
    4 2.23 0.77, 7.44 0.2
    5 1.60 0.55, 5.32 0.4
    6 0.70 0.20, 2.60 0.6
partners * age1

0.5
    <=1 * 2 1.06 0.20, 5.62 >0.9
    >=5 * 2 0.26 0.04, 1.62 0.14
    <=1 * 3 0.62 0.08, 4.12 0.6
    >=5 * 3 0.30 0.05, 1.90 0.2
    <=1 * 4 1.62 0.24, 10.5 0.6
    >=5 * 4 0.35 0.05, 2.25 0.3
    <=1 * 5 0.70 0.07, 5.21 0.7
    >=5 * 5 0.41 0.06, 2.65 0.3
    <=1 * 6 5.19 0.77, 35.7 0.088
    >=5 * 6 0.51 0.06, 4.33 0.5
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
Show the code
# Model with interaction
logit_inter <- glm(case ~ partners * age1,
    family = binomial,
    data = mwanza)

# Model without interaction
logit_without <- glm(case ~ partners + age1,
    family = binomial,
    data = mwanza)

# Likelihood ratio test
lmtest::lrtest(logit_without, logit_inter)
#Df LogLik Df Chisq Pr(>Chisq)
8 -385 NA NA NA
18 -380 10 9.4 0.49

Using partners variable in the analysis and conducting the interaction and no interaction model, the data are compatible with null hypothesis of no interaction: LRT gives \(\chi_{10}^2\)=9.43, p=0.49.

Q5

What other possible workarounds can you come up with for the issue identified in 4a.?

One alternative is to treat both npa and age1 as trend terms (ordinal values) for the interaction term

Q6

Estimates of Partners by age group with the interaction

Show the code
library(marginaleffects)
mod <- glm(case ~ partners * age1,
    family = binomial,
    data = mwanza)

knitr::kable(avg_comparisons(mod, variables = "partners",
                by = "age1",
                comparison = "lnor", # do the comparison in Log Odds
                transform = "exp")#back transform to odds ratio
      )
term contrast age1 estimate p.value s.value conf.low conf.high
partners ln(odds(<=1) / odds(2-4)) 1 0.44 0.25 2.02 0.11 1.8
partners ln(odds(>=5) / odds(2-4)) 1 5.12 0.05 4.39 1.02 25.8
partners ln(odds(<=1) / odds(2-4)) 2 0.47 0.08 3.61 0.20 1.1
partners ln(odds(>=5) / odds(2-4)) 2 1.32 0.51 0.97 0.58 3.0
partners ln(odds(<=1) / odds(2-4)) 3 0.27 0.05 4.19 0.07 1.0
partners ln(odds(>=5) / odds(2-4)) 3 1.55 0.31 1.69 0.67 3.6
partners ln(odds(<=1) / odds(2-4)) 4 0.72 0.60 0.74 0.21 2.5
partners ln(odds(>=5) / odds(2-4)) 4 1.77 0.22 2.21 0.72 4.4
partners ln(odds(<=1) / odds(2-4)) 5 0.31 0.14 2.87 0.07 1.4
partners ln(odds(>=5) / odds(2-4)) 5 2.09 0.11 3.14 0.84 5.2
partners ln(odds(<=1) / odds(2-4)) 6 2.29 0.21 2.26 0.63 8.3
partners ln(odds(>=5) / odds(2-4)) 6 2.62 0.18 2.51 0.65 10.6