Code
library(conflicted)
library(dplyr)
library(Fahrmeir)
library(MASS)
library(tidyr)
library(VGAM)Loading required package: stats4
Loading required package: splines
Fahrmeir and Tutz (2001) Examples 3.6 and 3.7
March 12, 2026
This script reproduces Examples 3.6 and 3.7 from Fahrmeir and Tutz (2001). We analyse the relationship between tonsil size of children and their Streptococcus pyogenes carrier status.
The response has three ordered categories:
presentenlargedgreat_enlarged.There is a good physical interpretation of the ordering, where a normal tonsil must first become enlarged before it can become greatly enlarged, and so we are motivated to try a sequential regression model.
Loading required package: stats4
Loading required package: splines
Run ?tonsil for information about the data. This dataset is very simple, since it only contains one predictor. We load it and preview the contents to see how it’s structured.
# A tibble: 6 × 3
Streptococcus.p Size n
<fct> <int> <int>
1 carriers 1 19
2 carriers 2 29
3 carriers 3 24
4 noncarriers 1 497
5 noncarriers 2 560
6 noncarriers 3 269
The tonsil data are in long format: one row per carrier-status and size category. For vglm, we want one row per carrier-status and one column per size count, referred to as wide format. We can reshape using tidyr::pivot_wider(). We also rename the size category to be more descriptive and convert it to a factor with labels.
# A tibble: 2 × 4
Streptococcus.p present enlarged great_enlarged
<fct> <int> <int> <int>
1 carriers 19 29 24
2 noncarriers 497 560 269
mutate(...): creates or updates columns; here it rewrites Size as a labelled factor.factor(...): converts numeric codes (1, 2, 3) into ordered category labels used in output.dplyr::select(Streptococcus.p, Size, n): keeps only the columns needed for reshaping.pivot_wider(names_from = Size, values_from = n): converts long data into wide data, so each size category becomes its own count column.|>: passes the result of one step to the next, making the workflow readable top-to-bottom.As in the other scripts, we want to ensure effect coding (contr.sum) is used for factors. This gives coefficients that are directly comparable to the textbook output.
We first fit a cumulative logit model.
Call:
vglm(formula = cbind(present, enlarged, great_enlarged) ~ Streptococcus.p,
family = cumulative(parallel = TRUE ~ Streptococcus.p - 1),
data = tonsil_wide)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept):1 -0.8098 0.1165 -6.952 3.61e-12 ***
(Intercept):2 1.0614 0.1182 8.983 < 2e-16 ***
Streptococcus.p1 -0.3013 0.1125 -2.677 0.00742 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Names of linear predictors: logitlink(P[Y<=1]), logitlink(P[Y<=2])
Residual deviance: 0.3022 on 1 degrees of freedom
Log-likelihood: -11.9142 on 1 degrees of freedom
Number of Fisher scoring iterations: 3
Exponentiated coefficients:
Streptococcus.p1
0.7398377
Here, the parallel = TRUE ~ Streptococcus.p - 1 argument specifies that the effect of Streptococcus.p is the same across all levels, and the - 1 removes the intercept from the model, which sets the intercept to be different, since the intercepts here are exactly the \(\theta\)s.
Note that the \(p\)-values do not line up with those in Table 3.6 of Fahrmeir and Tutz (2001) for a completely unknown reason. If you manage to figure out why, please let me know.
Here we don’t set parallel = TRUE and allow category-specific effects.
Warning in vglm.fitter(x = x, y = y, w = w, offset = offset, Xm2 = Xm2, :
iterations terminated because half-step sizes are very small
Warning in vglm.fitter(x = x, y = y, w = w, offset = offset, Xm2 = Xm2, : some
quantities such as z, residuals, SEs may be inaccurate due to convergence at a
half-step
Call:
vglm(formula = cbind(present, enlarged, great_enlarged) ~ Streptococcus.p,
family = cumulative(), data = tonsil_wide)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept):1 -0.7687 0.1367 -5.625 1.86e-08 ***
(Intercept):2 1.0308 0.1296 7.955 1.79e-15 ***
Streptococcus.p1:1 -0.2571 0.1367 -1.881 0.05994 .
Streptococcus.p1:2 -0.3377 0.1296 -2.606 0.00916 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Names of linear predictors: logitlink(P[Y<=1]), logitlink(P[Y<=2])
Residual deviance: -1.392e-13 on 0 degrees of freedom
Log-likelihood: -11.7631 on 0 degrees of freedom
Number of Fisher scoring iterations: 6
Exponentiated coefficients:
Streptococcus.p1:1 Streptococcus.p1:2
0.7732821 0.7134337
Here we fit a sequential model using VGAM::sratio().
HINT: it should have the form: \[\begin{equation*} \Pr(Y = r | Y \geq r, x) = \ldots \end{equation*}\]
Check Fahrmeir and Tutz (2001) page 94 for answer.
Call:
vglm(formula = cbind(present, enlarged, great_enlarged) ~ Streptococcus.p,
family = sratio(parallel = TRUE ~ Streptococcus.p - 1), data = tonsil_wide)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept):1 -0.77525 0.10608 -7.308 2.72e-13 ***
(Intercept):2 0.46795 0.11156 4.195 2.73e-05 ***
Streptococcus.p1 -0.26423 0.09887 -2.672 0.00753 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Names of linear predictors: logitlink(P[Y=1|Y>=1]), logitlink(P[Y=2|Y>=2])
Residual deviance: 0.0057 on 1 degrees of freedom
Log-likelihood: -11.7659 on 1 degrees of freedom
Number of Fisher scoring iterations: 3
Exponentiated coefficients:
Streptococcus.p1
0.7677964
Again, the \(p\)-values do not line up with those in Table 3.6 for an unknown reason, but the deviances do match. We can see that the deviance for the sequential model is much smaller than that of the cumulative model which suggests a much better fit to the data.
Here we fit the same cumulative model using MASS::polr() to verify our \(p\)-values. We convert our data back to long format.
# A tibble: 6 × 3
Streptococcus.p size weight
<fct> <ord> <int>
1 carriers present 19
2 carriers enlarged 29
3 carriers great_enlarged 24
4 noncarriers present 497
5 noncarriers enlarged 560
6 noncarriers great_enlarged 269
pivot_longer(...): turns the wide count columns back into long format (one row per category).cols = c(present, enlarged, great_enlarged): chooses which columns to stack into one variable.names_to = "size": stores the former column names as category labels in a new size column.values_to = "weight": stores the corresponding counts in a new weight column used as case weights.ordered(...): makes size an ordered factor so polr() respects the category order.|>: pipes each intermediate result to the next transformation step.Call:
polr(formula = size ~ Streptococcus.p, data = polr_dat, weights = weight,
Hess = TRUE)
Coefficients:
Value Std. Error t value
Streptococcus.p1 0.3013 0.1137 2.65
Intercepts:
Value Std. Error t value
present|enlarged -0.8098 0.1176 -6.8853
enlarged|great_enlarged 1.0614 0.1193 8.8993
Residual Deviance: 2955.495
AIC: 2961.495
The polr() standard errors agree with the cumulative vglm() fit.
packages <- c("conflicted", "dplyr", "Fahrmeir", "MASS", "tidyr", "VGAM")
for (pkg in packages) {
cit <- citation(pkg)
# Check if multiple citations
if (length(cit) > 1) {
cat("- **", pkg, "**:\n", sep = "")
for (i in seq_along(cit)) {
cit_text <- format(cit[i], style = "text")
cat(" - ", cit_text, "\n\n", sep = "")
}
} else {
cit_text <- format(cit, style = "text")
cat("- **", pkg, "**: ", cit_text, "\n\n", sep = "")
}
}conflicted: Wickham H (2023). conflicted: An Alternative Conflict Resolution Strategy. R package version 1.2.0, https://github.com/r-lib/conflicted, https://conflicted.r-lib.org/.
dplyr: Wickham H, François R, Henry L, Müller K, Vaughan D (2026). dplyr: A Grammar of Data Manipulation. R package version 1.2.0, https://github.com/tidyverse/dplyr, https://dplyr.tidyverse.org.
Fahrmeir: Halvorsen cbKB (2016). Fahrmeir: Data from the Book “Multivariate Statistical Modelling Based on Generalized Linear Models”, First Edition, by Ludwig Fahrmeir and Gerhard Tutz. R package version 2016.5.31.
MASS: Venables WN, Ripley BD (2002). Modern Applied Statistics with S, Fourth edition. Springer, New York. ISBN 0-387-95457-0, https://www.stats.ox.ac.uk/pub/MASS4/.
tidyr: Wickham H, Vaughan D, Girlich M (2025). tidyr: Tidy Messy Data. R package version 1.3.2, https://github.com/tidyverse/tidyr, https://tidyr.tidyverse.org.
VGAM:
Yee TW (2015). Vector Generalized Linear and Additive Models: With an Implementation in R. Springer, New York, USA.
Yee TW, Wild CJ (1996). “Vector Generalized Additive Models.” Journal of Royal Statistical Society, Series B, 58(3), 481-493.
Yee TW (2010). “The VGAM Package for Categorical Data Analysis.” Journal of Statistical Software, 32(10), 1-34. doi:10.18637/jss.v032.i10 https://doi.org/10.18637/jss.v032.i10.
Yee TW, Hadi AF (2014). “Row-column interaction models, with an R implementation.” Computational Statistics, 29(6), 1427-1445.
Yee TW (2025). VGAM: Vector Generalized Linear and Additive Models. R package version 1.1-14, https://CRAN.R-project.org/package=VGAM.
Yee TW (2013). “Two-parameter reduced-rank vector generalized linear models.” Computational Statistics and Data Analysis, 71, 889-902.
Yee TW, Stoklosa J, Huggins RM (2015). “The VGAM Package for Capture-Recapture Data Using the Conditional Likelihood.” Journal of Statistical Software, 65(5), 1-33. doi:10.18637/jss.v065.i05 https://doi.org/10.18637/jss.v065.i05.
Yee TW (2020). “The VGAM package for negative binomial regression.” Australian and New Zealand Journal of Statistics, 62(1), 116-131.
Yee TW (2022). “On the Hauck-Donner effect in Wald tests: Detection, tipping points and parameter space characterization.” Journal of the American Statistical Association, 117(540), 1763-1774.
Yee TW, Ma C (2024). “Generally altered, inflated, truncated and deflated regression.” Statistical Science, 39(4), 568-588.
Yee TW, Frigau L, Ma C (2025). “Heaping and seeping, GAITD regression and doubly constrained reduced-rank vector generalized linear models.” Annals of Applied Statistics, (in press), 1-25.
R version 4.4.2 (2024-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS 26.3.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
time zone: Australia/Melbourne
tzcode source: internal
attached base packages:
[1] splines stats4 stats graphics grDevices datasets utils
[8] methods base
other attached packages:
[1] VGAM_1.1-14 tidyr_1.3.2 MASS_7.3-61 Fahrmeir_2016.5.31
[5] dplyr_1.2.0 conflicted_1.2.0
loaded via a namespace (and not attached):
[1] vctrs_0.7.1 cli_3.6.5 knitr_1.51 rlang_1.1.7
[5] xfun_0.56 otel_0.2.0 purrr_1.2.1 renv_1.1.7
[9] generics_0.1.4 jsonlite_2.0.0 glue_1.8.0 htmltools_0.5.9
[13] rmarkdown_2.30 evaluate_1.0.5 tibble_3.3.1 fastmap_1.2.0
[17] yaml_2.3.12 lifecycle_1.0.5 memoise_2.0.1 compiler_4.4.2
[21] pkgconfig_2.0.3 digest_0.6.39 R6_2.6.1 utf8_1.2.6
[25] tidyselect_1.2.1 pillar_1.11.1 magrittr_2.0.4 withr_3.0.2
[29] tools_4.4.2 cachem_1.1.0