Sequential Regression: Tonsil Size and Streptococcus

Fahrmeir and Tutz (2001) Examples 3.6 and 3.7

Published

April 15, 2026

Overview

This script reproduces Examples 3.6 and 3.7 from Fahrmeir and Tutz (2001). We analyse the relationship between tonsil size of children and their Streptococcus pyogenes carrier status.

The response has three ordered categories:

present
enlarged
great_enlarged.

There is a good physical interpretation of the ordering, where a normal tonsil must first become enlarged before it can become greatly enlarged, and so we are motivated to try a sequential regression model.

Data and Packages

Code

library(conflicted)
library(dplyr)
library(Fahrmeir)
library(MASS)
library(tidyr)
library(VGAM)

Loading required package: stats4

Loading required package: splines

Run ?tonsil for information about the data. This dataset is very simple, since it only contains one predictor. We load it and preview the contents to see how it’s structured.

Code

data(tonsil)
tonsil_tbl <- as_tibble(tonsil)
tonsil_tbl

# A tibble: 6 × 3
  Streptococcus.p  Size     n
  <fct>           <int> <int>
1 carriers            1    19
2 carriers            2    29
3 carriers            3    24
4 noncarriers         1   497
5 noncarriers         2   560
6 noncarriers         3   269

Data Reshaping

The tonsil data are in long format: one row per carrier-status and size category. For vglm, we want one row per carrier-status and one column per size count, referred to as wide format. We can reshape using tidyr::pivot_wider(). We also rename the size category to be more descriptive and convert it to a factor with labels.

Code

tonsil_wide <- tonsil_tbl |>
  mutate(
    Size = factor(
      Size,
      levels = c(1, 2, 3),
      labels = c("present", "enlarged", "great_enlarged")
    )
  ) |>
  dplyr::select(Streptococcus.p, Size, n) |>
  pivot_wider(names_from = Size, values_from = n)

tonsil_wide

# A tibble: 2 × 4
  Streptococcus.p present enlarged great_enlarged
  <fct>             <int>    <int>          <int>
1 carriers             19       29             24
2 noncarriers         497      560            269

What each step is doing

dplyr::mutate(...): creates or updates columns; here it rewrites Size as a labelled factor.
factor(...): converts numeric codes (1, 2, 3) into ordered category labels used in output.
dplyr::select(Streptococcus.p, Size, n): keeps only the columns needed for reshaping; in this case we want all three.
tidyr::pivot_wider(names_from = Size, values_from = n): converts long data into wide data, so each size category becomes its own count column. The name_from argument says we want the content of the Size column to become our column names, and the values_from argument says we want the content of the n column to become the cell values.
|>: passes the result of one step to the next, making the workflow readable top-to-bottom.

Understanding Factor Coding

As in the other scripts, we want to ensure effect coding (contr.sum) is used for factors. This gives coefficients that are directly comparable to the textbook output on p.97.

Code

contrasts(tonsil_wide$Streptococcus.p)

            noncarriers
carriers              0
noncarriers           1

Code

options(contrasts = c("contr.sum", "contr.poly"))
contrasts(tonsil_wide$Streptococcus.p)

            [,1]
carriers       1
noncarriers   -1

Model 1: Cumulative Link Model (Proportional Odds)

We first fit a cumulative logit model.

Code

fit_cumulative <- vglm(
  cbind(present, enlarged, great_enlarged) ~ Streptococcus.p,
  family = cumulative(parallel = TRUE ~ Streptococcus.p - 1),
  data = tonsil_wide
)

summary(fit_cumulative)


Call:
vglm(formula = cbind(present, enlarged, great_enlarged) ~ Streptococcus.p, 
    family = cumulative(parallel = TRUE ~ Streptococcus.p - 1), 
    data = tonsil_wide)

Coefficients: 
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept):1     -0.8098     0.1165  -6.952 3.61e-12 ***
(Intercept):2      1.0614     0.1182   8.983  < 2e-16 ***
Streptococcus.p1  -0.3013     0.1125  -2.677  0.00742 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Names of linear predictors: logitlink(P[Y<=1]), logitlink(P[Y<=2])

Residual deviance: 0.3022 on 1 degrees of freedom

Log-likelihood: -11.9142 on 1 degrees of freedom

Number of Fisher scoring iterations: 3 


Exponentiated coefficients:
Streptococcus.p1 
       0.7398377

Caution

Here, the parallel = TRUE ~ Streptococcus.p - 1 argument specifies that the effect of Streptococcus.p is the same across all levels, and the - 1 removes the intercept from the model, which sets the intercept to be different, since the intercepts here are exactly the \(\theta\)s.

Note that the \(p\)-values do not line up with those in Table 3.6 of Fahrmeir and Tutz (2001) for a completely unknown reason. If you manage to figure out why, please let me know.

Model 2: Extended Cumulative Model

Here we don’t set parallel = TRUE and allow category-specific effects.

Code

fit_cumulative_extended <- vglm(
  cbind(present, enlarged, great_enlarged) ~ Streptococcus.p,
  family = cumulative(),
  data = tonsil_wide
)

Warning in vglm.fitter(x = x, y = y, w = w, offset = offset, Xm2 = Xm2, :
iterations terminated because half-step sizes are very small

Warning in vglm.fitter(x = x, y = y, w = w, offset = offset, Xm2 = Xm2, : some
quantities such as z, residuals, SEs may be inaccurate due to convergence at a
half-step

Code

summary(fit_cumulative_extended)


Call:
vglm(formula = cbind(present, enlarged, great_enlarged) ~ Streptococcus.p, 
    family = cumulative(), data = tonsil_wide)

Coefficients: 
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept):1       -0.7687     0.1367  -5.625 1.86e-08 ***
(Intercept):2        1.0308     0.1296   7.955 1.79e-15 ***
Streptococcus.p1:1  -0.2571     0.1367  -1.881  0.05994 .  
Streptococcus.p1:2  -0.3377     0.1296  -2.606  0.00916 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Names of linear predictors: logitlink(P[Y<=1]), logitlink(P[Y<=2])

Residual deviance: -1.472e-13 on 0 degrees of freedom

Log-likelihood: -11.7631 on 0 degrees of freedom

Number of Fisher scoring iterations: 5 


Exponentiated coefficients:
Streptococcus.p1:1 Streptococcus.p1:2 
         0.7732821          0.7134337

Model 3: Sequential Model

Here we fit a sequential model using VGAM::sratio().

Make sure you can write down the mathematical form of the model. Expand for hint.

HINT: it should have the form: \[\begin{equation*} \Pr(Y = r | Y \geq r, x) = \ldots \end{equation*}\]

Check Fahrmeir and Tutz (2001) page 94 for answer.

Code

fit_sequential <- vglm(
  cbind(present, enlarged, great_enlarged) ~ Streptococcus.p,
  family = sratio(parallel = TRUE ~ Streptococcus.p - 1),
  data = tonsil_wide
)

summary(fit_sequential)


Call:
vglm(formula = cbind(present, enlarged, great_enlarged) ~ Streptococcus.p, 
    family = sratio(parallel = TRUE ~ Streptococcus.p - 1), data = tonsil_wide)

Coefficients: 
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept):1    -0.77525    0.10608  -7.308 2.72e-13 ***
(Intercept):2     0.46795    0.11156   4.195 2.73e-05 ***
Streptococcus.p1 -0.26423    0.09887  -2.672  0.00753 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Names of linear predictors: logitlink(P[Y=1|Y>=1]), logitlink(P[Y=2|Y>=2])

Residual deviance: 0.0057 on 1 degrees of freedom

Log-likelihood: -11.7659 on 1 degrees of freedom

Number of Fisher scoring iterations: 3 


Exponentiated coefficients:
Streptococcus.p1 
       0.7677964

Again, the \(p\)-values do not line up with those in Table 3.6 for an unknown reason, but the deviances do match. We can see that the deviance for the sequential model is much smaller than that of the cumulative model which suggests a much better fit to the data.

Alternative: MASS::polr() Implementation

Here we fit the same cumulative model using MASS::polr() to verify our \(p\)-values. We convert our data back to long format.

Code

polr_dat <- tonsil_wide |>
  pivot_longer(
    cols = c(present, enlarged, great_enlarged),
    names_to = "size",
    values_to = "weight"
  ) |>
  mutate(
    size = ordered(size, levels = c("present", "enlarged", "great_enlarged"))
  )

polr_dat

# A tibble: 6 × 3
  Streptococcus.p size           weight
  <fct>           <ord>           <int>
1 carriers        present            19
2 carriers        enlarged           29
3 carriers        great_enlarged     24
4 noncarriers     present           497
5 noncarriers     enlarged          560
6 noncarriers     great_enlarged    269

What each step is doing

tidyr::pivot_longer(...): turns the wide count columns back into long format (one row per category).
cols = c(present, enlarged, great_enlarged): chooses which columns to stack into one variable.
names_to = "size": stores the former column names as category labels in a new size column.
values_to = "weight": stores the corresponding counts in a new weight column used as case weights.
ordered(...): makes size an ordered factor so MASS::polr() respects the category order.
|>: pipes each intermediate result to the next transformation step.

Code

polr_obj <- polr(
  formula = size ~ Streptococcus.p,
  data = polr_dat,
  weights = weight,
  Hess = TRUE
)

summary(polr_obj)

Call:
polr(formula = size ~ Streptococcus.p, data = polr_dat, weights = weight, 
    Hess = TRUE)

Coefficients:
                  Value Std. Error t value
Streptococcus.p1 0.3013     0.1137    2.65

Intercepts:
                        Value   Std. Error t value
present|enlarged        -0.8098  0.1176    -6.8853
enlarged|great_enlarged  1.0614  0.1193     8.8993

Residual Deviance: 2955.495 
AIC: 2961.495

The MASS::polr() standard errors agree with the cumulative VGAM::vglm() fit.

Package Citations and Session Information

Session Information

Code

sessionInfo()

R version 4.4.2 (2024-10-31)
Platform: x86_64-apple-darwin20
Running under: macOS Sequoia 15.7.7

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Australia/Melbourne
tzcode source: internal

attached base packages:
[1] splines   stats4    stats     graphics  grDevices datasets  utils    
[8] methods   base     

other attached packages:
[1] VGAM_1.1-14        tidyr_1.3.2        MASS_7.3-61        Fahrmeir_2016.5.31
[5] dplyr_1.2.0        conflicted_1.2.0  

loaded via a namespace (and not attached):
 [1] vctrs_0.7.1       cli_3.6.5         knitr_1.51        rlang_1.1.7      
 [5] xfun_0.56         purrr_1.2.1       renv_1.1.7        generics_0.1.4   
 [9] jsonlite_2.0.0    glue_1.8.0        htmltools_0.5.9   rmarkdown_2.30   
[13] evaluate_1.0.5    tibble_3.3.1      fastmap_1.2.0     yaml_2.3.12      
[17] lifecycle_1.0.5   memoise_2.0.1     compiler_4.4.2    pkgconfig_2.0.3  
[21] rstudioapi_0.18.0 digest_0.6.39     R6_2.6.1          utf8_1.2.6       
[25] tidyselect_1.2.1  pillar_1.11.1     magrittr_2.0.4    withr_3.0.2      
[29] tools_4.4.2       cachem_1.1.0

Code

packages <- c("conflicted", "dplyr", "Fahrmeir", "MASS", "tidyr", "VGAM")
for (pkg in packages) {
  cit <- citation(pkg)

  # Check if multiple citations
  if (length(cit) > 1) {
    cat("- **", pkg, "**:\n", sep = "")
    for (i in seq_along(cit)) {
      cit_text <- format(cit[i], style = "text")
      cat("    - ", cit_text, "\n\n", sep = "")
    }
  } else {
    cit_text <- format(cit, style = "text")
    cat("- **", pkg, "**: ", cit_text, "\n\n", sep = "")
  }
}

conflicted: Wickham H (2023). conflicted: An Alternative Conflict Resolution Strategy. R package version 1.2.0, https://github.com/r-lib/conflicted, https://conflicted.r-lib.org/.
dplyr: Wickham H, François R, Henry L, Müller K, Vaughan D (2026). dplyr: A Grammar of Data Manipulation. R package version 1.2.0, https://github.com/tidyverse/dplyr, https://dplyr.tidyverse.org.
Fahrmeir: Halvorsen cbKB (2016). Fahrmeir: Data from the Book “Multivariate Statistical Modelling Based on Generalized Linear Models”, First Edition, by Ludwig Fahrmeir and Gerhard Tutz. R package version 2016.5.31.
MASS: Venables WN, Ripley BD (2002). Modern Applied Statistics with S, Fourth edition. Springer, New York. ISBN 0-387-95457-0, https://www.stats.ox.ac.uk/pub/MASS4/.
tidyr: Wickham H, Vaughan D, Girlich M (2025). tidyr: Tidy Messy Data. R package version 1.3.2, https://github.com/tidyverse/tidyr, https://tidyr.tidyverse.org.
VGAM:
- Yee TW (2015). Vector Generalized Linear and Additive Models: With an Implementation in R. Springer, New York, USA.
- Yee TW, Wild CJ (1996). “Vector Generalized Additive Models.” Journal of Royal Statistical Society, Series B, 58(3), 481-493.
- Yee TW (2010). “The VGAM Package for Categorical Data Analysis.” Journal of Statistical Software, 32(10), 1-34. doi:10.18637/jss.v032.i10 https://doi.org/10.18637/jss.v032.i10.
- Yee TW, Hadi AF (2014). “Row-column interaction models, with an R implementation.” Computational Statistics, 29(6), 1427-1445.
- Yee TW (2025). VGAM: Vector Generalized Linear and Additive Models. R package version 1.1-14, https://CRAN.R-project.org/package=VGAM.
- Yee TW (2013). “Two-parameter reduced-rank vector generalized linear models.” Computational Statistics and Data Analysis, 71, 889-902.
- Yee TW, Stoklosa J, Huggins RM (2015). “The VGAM Package for Capture-Recapture Data Using the Conditional Likelihood.” Journal of Statistical Software, 65(5), 1-33. doi:10.18637/jss.v065.i05 https://doi.org/10.18637/jss.v065.i05.
- Yee TW (2020). “The VGAM package for negative binomial regression.” Australian and New Zealand Journal of Statistics, 62(1), 116-131.
- Yee TW (2022). “On the Hauck-Donner effect in Wald tests: Detection, tipping points and parameter space characterization.” Journal of the American Statistical Association, 117(540), 1763-1774.
- Yee TW, Ma C (2024). “Generally altered, inflated, truncated and deflated regression.” Statistical Science, 39(4), 568-588.
- Yee TW, Frigau L, Ma C (2025). “Heaping and seeping, GAITD regression and doubly constrained reduced-rank vector generalized linear models.” Annals of Applied Statistics, (in press), 1-25.

Code

sessionInfo()

R version 4.4.2 (2024-10-31)
Platform: x86_64-apple-darwin20
Running under: macOS Sequoia 15.7.7

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Australia/Melbourne
tzcode source: internal

attached base packages:
[1] splines   stats4    stats     graphics  grDevices datasets  utils    
[8] methods   base     

other attached packages:
[1] VGAM_1.1-14        tidyr_1.3.2        MASS_7.3-61        Fahrmeir_2016.5.31
[5] dplyr_1.2.0        conflicted_1.2.0  

loaded via a namespace (and not attached):
 [1] vctrs_0.7.1       cli_3.6.5         knitr_1.51        rlang_1.1.7      
 [5] xfun_0.56         purrr_1.2.1       renv_1.1.7        generics_0.1.4   
 [9] jsonlite_2.0.0    glue_1.8.0        htmltools_0.5.9   rmarkdown_2.30   
[13] evaluate_1.0.5    tibble_3.3.1      fastmap_1.2.0     yaml_2.3.12      
[17] lifecycle_1.0.5   memoise_2.0.1     compiler_4.4.2    pkgconfig_2.0.3  
[21] rstudioapi_0.18.0 digest_0.6.39     R6_2.6.1          utf8_1.2.6       
[25] tidyselect_1.2.1  pillar_1.11.1     magrittr_2.0.4    withr_3.0.2      
[29] tools_4.4.2       cachem_1.1.0

References

Fahrmeir, Ludwig, and Gerhard Tutz. 2001. Multivariate Statistical Modelling Based on Generalized Linear Models. Second Edition. Springer Series in Statistics. Springer. https://doi.org/10.1007/978-1-4757-3454-6.