Sequential Regression: Tonsil Size and Streptococcus

Fahrmeir and Tutz (2001) Examples 3.6 and 3.7

Published

March 12, 2026

Overview

This script reproduces Examples 3.6 and 3.7 from Fahrmeir and Tutz (2001). We analyse the relationship between tonsil size of children and their Streptococcus pyogenes carrier status.

The response has three ordered categories:

  1. present
  2. enlarged
  3. great_enlarged.

There is a good physical interpretation of the ordering, where a normal tonsil must first become enlarged before it can become greatly enlarged, and so we are motivated to try a sequential regression model.

Data and Packages

Code
library(conflicted)
library(dplyr)
library(Fahrmeir)
library(MASS)
library(tidyr)
library(VGAM)
Loading required package: stats4
Loading required package: splines

Run ?tonsil for information about the data. This dataset is very simple, since it only contains one predictor. We load it and preview the contents to see how it’s structured.

Code
data(tonsil)
tonsil_tbl <- as_tibble(tonsil)
tonsil_tbl
# A tibble: 6 × 3
  Streptococcus.p  Size     n
  <fct>           <int> <int>
1 carriers            1    19
2 carriers            2    29
3 carriers            3    24
4 noncarriers         1   497
5 noncarriers         2   560
6 noncarriers         3   269

Data Reshaping

The tonsil data are in long format: one row per carrier-status and size category. For vglm, we want one row per carrier-status and one column per size count, referred to as wide format. We can reshape using tidyr::pivot_wider(). We also rename the size category to be more descriptive and convert it to a factor with labels.

Code
tonsil_wide <- tonsil_tbl |>
    mutate(
        Size = factor(
            Size,
            levels = c(1, 2, 3),
            labels = c("present", "enlarged", "great_enlarged")
        )
    ) |>
    dplyr::select(Streptococcus.p, Size, n) |>
    pivot_wider(names_from = Size, values_from = n)

tonsil_wide
# A tibble: 2 × 4
  Streptococcus.p present enlarged great_enlarged
  <fct>             <int>    <int>          <int>
1 carriers             19       29             24
2 noncarriers         497      560            269
  • mutate(...): creates or updates columns; here it rewrites Size as a labelled factor.
  • factor(...): converts numeric codes (1, 2, 3) into ordered category labels used in output.
  • dplyr::select(Streptococcus.p, Size, n): keeps only the columns needed for reshaping.
  • pivot_wider(names_from = Size, values_from = n): converts long data into wide data, so each size category becomes its own count column.
  • |>: passes the result of one step to the next, making the workflow readable top-to-bottom.

Understanding Factor Coding

As in the other scripts, we want to ensure effect coding (contr.sum) is used for factors. This gives coefficients that are directly comparable to the textbook output.

Code
contrasts(tonsil_wide$Streptococcus.p)
            noncarriers
carriers              0
noncarriers           1
Code
options(contrasts = c("contr.sum", "contr.poly"))
contrasts(tonsil_wide$Streptococcus.p)
            [,1]
carriers       1
noncarriers   -1

Model 2: Extended Cumulative Model

Here we don’t set parallel = TRUE and allow category-specific effects.

Code
fit_cumulative_extended <- vglm(
    cbind(present, enlarged, great_enlarged) ~ Streptococcus.p,
    family = cumulative(),
    data = tonsil_wide
)
Warning in vglm.fitter(x = x, y = y, w = w, offset = offset, Xm2 = Xm2, :
iterations terminated because half-step sizes are very small
Warning in vglm.fitter(x = x, y = y, w = w, offset = offset, Xm2 = Xm2, : some
quantities such as z, residuals, SEs may be inaccurate due to convergence at a
half-step
Code
summary(fit_cumulative_extended)

Call:
vglm(formula = cbind(present, enlarged, great_enlarged) ~ Streptococcus.p, 
    family = cumulative(), data = tonsil_wide)

Coefficients: 
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept):1       -0.7687     0.1367  -5.625 1.86e-08 ***
(Intercept):2        1.0308     0.1296   7.955 1.79e-15 ***
Streptococcus.p1:1  -0.2571     0.1367  -1.881  0.05994 .  
Streptococcus.p1:2  -0.3377     0.1296  -2.606  0.00916 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Names of linear predictors: logitlink(P[Y<=1]), logitlink(P[Y<=2])

Residual deviance: -1.392e-13 on 0 degrees of freedom

Log-likelihood: -11.7631 on 0 degrees of freedom

Number of Fisher scoring iterations: 6 


Exponentiated coefficients:
Streptococcus.p1:1 Streptococcus.p1:2 
         0.7732821          0.7134337 

Model 3: Sequential Model

Here we fit a sequential model using VGAM::sratio().

HINT: it should have the form: \[\begin{equation*} \Pr(Y = r | Y \geq r, x) = \ldots \end{equation*}\]

Check Fahrmeir and Tutz (2001) page 94 for answer.

Code
fit_sequential <- vglm(
    cbind(present, enlarged, great_enlarged) ~ Streptococcus.p,
    family = sratio(parallel = TRUE ~ Streptococcus.p - 1),
    data = tonsil_wide
)

summary(fit_sequential)

Call:
vglm(formula = cbind(present, enlarged, great_enlarged) ~ Streptococcus.p, 
    family = sratio(parallel = TRUE ~ Streptococcus.p - 1), data = tonsil_wide)

Coefficients: 
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept):1    -0.77525    0.10608  -7.308 2.72e-13 ***
(Intercept):2     0.46795    0.11156   4.195 2.73e-05 ***
Streptococcus.p1 -0.26423    0.09887  -2.672  0.00753 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Names of linear predictors: logitlink(P[Y=1|Y>=1]), logitlink(P[Y=2|Y>=2])

Residual deviance: 0.0057 on 1 degrees of freedom

Log-likelihood: -11.7659 on 1 degrees of freedom

Number of Fisher scoring iterations: 3 


Exponentiated coefficients:
Streptococcus.p1 
       0.7677964 

Again, the \(p\)-values do not line up with those in Table 3.6 for an unknown reason, but the deviances do match. We can see that the deviance for the sequential model is much smaller than that of the cumulative model which suggests a much better fit to the data.

Here we fit the same cumulative model using MASS::polr() to verify our \(p\)-values. We convert our data back to long format.

Code
polr_dat <- tonsil_wide |>
    pivot_longer(
        cols = c(present, enlarged, great_enlarged),
        names_to = "size",
        values_to = "weight"
    ) |>
    mutate(
        size = ordered(size, levels = c("present", "enlarged", "great_enlarged"))
    )

polr_dat
# A tibble: 6 × 3
  Streptococcus.p size           weight
  <fct>           <ord>           <int>
1 carriers        present            19
2 carriers        enlarged           29
3 carriers        great_enlarged     24
4 noncarriers     present           497
5 noncarriers     enlarged          560
6 noncarriers     great_enlarged    269
  • pivot_longer(...): turns the wide count columns back into long format (one row per category).
  • cols = c(present, enlarged, great_enlarged): chooses which columns to stack into one variable.
  • names_to = "size": stores the former column names as category labels in a new size column.
  • values_to = "weight": stores the corresponding counts in a new weight column used as case weights.
  • ordered(...): makes size an ordered factor so polr() respects the category order.
  • |>: pipes each intermediate result to the next transformation step.
Code
polr_obj <- polr(
    formula = size ~ Streptococcus.p,
    data = polr_dat,
    weights = weight,
    Hess = TRUE
)

summary(polr_obj)
Call:
polr(formula = size ~ Streptococcus.p, data = polr_dat, weights = weight, 
    Hess = TRUE)

Coefficients:
                  Value Std. Error t value
Streptococcus.p1 0.3013     0.1137    2.65

Intercepts:
                        Value   Std. Error t value
present|enlarged        -0.8098  0.1176    -6.8853
enlarged|great_enlarged  1.0614  0.1193     8.8993

Residual Deviance: 2955.495 
AIC: 2961.495 

The polr() standard errors agree with the cumulative vglm() fit.

Reproducibility

Code
packages <- c("conflicted", "dplyr", "Fahrmeir", "MASS", "tidyr", "VGAM")
for (pkg in packages) {
    cit <- citation(pkg)

    # Check if multiple citations
    if (length(cit) > 1) {
        cat("- **", pkg, "**:\n", sep = "")
        for (i in seq_along(cit)) {
            cit_text <- format(cit[i], style = "text")
            cat("    - ", cit_text, "\n\n", sep = "")
        }
    } else {
        cit_text <- format(cit, style = "text")
        cat("- **", pkg, "**: ", cit_text, "\n\n", sep = "")
    }
}
  • conflicted: Wickham H (2023). conflicted: An Alternative Conflict Resolution Strategy. R package version 1.2.0, https://github.com/r-lib/conflicted, https://conflicted.r-lib.org/.

  • dplyr: Wickham H, François R, Henry L, Müller K, Vaughan D (2026). dplyr: A Grammar of Data Manipulation. R package version 1.2.0, https://github.com/tidyverse/dplyr, https://dplyr.tidyverse.org.

  • Fahrmeir: Halvorsen cbKB (2016). Fahrmeir: Data from the Book “Multivariate Statistical Modelling Based on Generalized Linear Models”, First Edition, by Ludwig Fahrmeir and Gerhard Tutz. R package version 2016.5.31.

  • MASS: Venables WN, Ripley BD (2002). Modern Applied Statistics with S, Fourth edition. Springer, New York. ISBN 0-387-95457-0, https://www.stats.ox.ac.uk/pub/MASS4/.

  • tidyr: Wickham H, Vaughan D, Girlich M (2025). tidyr: Tidy Messy Data. R package version 1.3.2, https://github.com/tidyverse/tidyr, https://tidyr.tidyverse.org.

  • VGAM:

    • Yee TW (2015). Vector Generalized Linear and Additive Models: With an Implementation in R. Springer, New York, USA.

    • Yee TW, Wild CJ (1996). “Vector Generalized Additive Models.” Journal of Royal Statistical Society, Series B, 58(3), 481-493.

    • Yee TW (2010). “The VGAM Package for Categorical Data Analysis.” Journal of Statistical Software, 32(10), 1-34. doi:10.18637/jss.v032.i10 https://doi.org/10.18637/jss.v032.i10.

    • Yee TW, Hadi AF (2014). “Row-column interaction models, with an R implementation.” Computational Statistics, 29(6), 1427-1445.

    • Yee TW (2025). VGAM: Vector Generalized Linear and Additive Models. R package version 1.1-14, https://CRAN.R-project.org/package=VGAM.

    • Yee TW (2013). “Two-parameter reduced-rank vector generalized linear models.” Computational Statistics and Data Analysis, 71, 889-902.

    • Yee TW, Stoklosa J, Huggins RM (2015). “The VGAM Package for Capture-Recapture Data Using the Conditional Likelihood.” Journal of Statistical Software, 65(5), 1-33. doi:10.18637/jss.v065.i05 https://doi.org/10.18637/jss.v065.i05.

    • Yee TW (2020). “The VGAM package for negative binomial regression.” Australian and New Zealand Journal of Statistics, 62(1), 116-131.

    • Yee TW (2022). “On the Hauck-Donner effect in Wald tests: Detection, tipping points and parameter space characterization.” Journal of the American Statistical Association, 117(540), 1763-1774.

    • Yee TW, Ma C (2024). “Generally altered, inflated, truncated and deflated regression.” Statistical Science, 39(4), 568-588.

    • Yee TW, Frigau L, Ma C (2025). “Heaping and seeping, GAITD regression and doubly constrained reduced-rank vector generalized linear models.” Annals of Applied Statistics, (in press), 1-25.

Code
sessionInfo()
R version 4.4.2 (2024-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS 26.3.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

time zone: Australia/Melbourne
tzcode source: internal

attached base packages:
[1] splines   stats4    stats     graphics  grDevices datasets  utils    
[8] methods   base     

other attached packages:
[1] VGAM_1.1-14        tidyr_1.3.2        MASS_7.3-61        Fahrmeir_2016.5.31
[5] dplyr_1.2.0        conflicted_1.2.0  

loaded via a namespace (and not attached):
 [1] vctrs_0.7.1      cli_3.6.5        knitr_1.51       rlang_1.1.7     
 [5] xfun_0.56        otel_0.2.0       purrr_1.2.1      renv_1.1.7      
 [9] generics_0.1.4   jsonlite_2.0.0   glue_1.8.0       htmltools_0.5.9 
[13] rmarkdown_2.30   evaluate_1.0.5   tibble_3.3.1     fastmap_1.2.0   
[17] yaml_2.3.12      lifecycle_1.0.5  memoise_2.0.1    compiler_4.4.2  
[21] pkgconfig_2.0.3  digest_0.6.39    R6_2.6.1         utf8_1.2.6      
[25] tidyselect_1.2.1 pillar_1.11.1    magrittr_2.0.4   withr_3.0.2     
[29] tools_4.4.2      cachem_1.1.0    

References

Fahrmeir, Ludwig, and Gerhard Tutz. 2001. Multivariate Statistical Modelling Based on Generalized Linear Models. Second Edition. Springer Series in Statistics. New York, NY: Springer. https://doi.org/10.1007/978-1-4757-3454-6.