Code
library(conflicted)
library(gee)
library(geepack)
library(dplyr)Fahrmeir and Tutz (2001) Example 6.4
April 15, 2026
In this example, we reproduce Example 6.4 from Fahrmeir and Tutz (2001). The dataset is a study of the effect of maternal smoking status and age on respiratory infection in children. The children were examined yearly from ages 7 to 10.
Previously we have mentioned that the geepack implements its geeglm function differently to the gee function in the gee package. Here we load it simply to access the ohio dataset.
As always it is a good practice to inspect the help page for the dataset before proceeding by running ?ohio.
We convert the dataset to a tibble, and inspect it.
# A tibble: 2,148 × 4
resp id age smoke
<int> <int> <int> <int>
1 0 0 -2 0
2 0 0 -1 0
3 0 0 0 0
4 0 0 1 0
5 0 1 -2 0
6 0 1 -1 0
7 0 1 0 0
8 0 1 1 0
9 0 2 -2 0
10 0 2 -1 0
# ℹ 2,138 more rows
The age and smoke variables are currently integers, but should instead be factors. We convert them to factors before fitting models.
# A tibble: 2,148 × 4
resp id age smoke
<int> <int> <fct> <fct>
1 0 0 -2 0
2 0 0 -1 0
3 0 0 0 0
4 0 0 1 0
5 0 1 -2 0
6 0 1 -1 0
7 0 1 0 0
8 0 1 1 0
9 0 2 -2 0
10 0 2 -1 0
# ℹ 2,138 more rows
The age coding in the dataset is:
As in the textbook, we need effect coding for unordered factors.
unordered ordered
"contr.treatment" "contr.poly"
-1 0 1
-2 0 0 0
-1 1 0 0
0 0 1 0
1 0 0 1
1
0 0
1 1
It is currently set to dummy coding, so we change it to effect coding and verify:
[,1] [,2] [,3]
-2 1 0 0
-1 0 1 0
0 0 0 1
1 -1 -1 -1
[,1]
0 1
1 -1
10 years old is already the reference level for the age factor, but we need to explicitly set the reference level for smoke to match the textbook’s.
We fit a marginal logit model under three working correlation structures: independent, exchangeable, and unspecified.
Let
The marginal model is a logit model, and so \[\begin{equation} \begin{split} \log \frac{P(\text{infection})}{P(\text{no infection})} &= \beta_0 + \beta_\text{S} x_\text{S} + \beta_\text{Age7} x_\text{Age7} + \beta_\text{Age8} x_\text{Age8} + \beta_\text{Age9} x_\text{Age9}\\ &\quad + \beta_\text{S,Age7} x_\text{S} x_\text{Age7} + \beta_\text{S,Age8} x_\text{S} x_\text{Age8} + \beta_\text{S,Age9} x_\text{S} x_\text{Age9} \end{split} \end{equation}\] where \(x_\text{S} = 1\) if the mother smokes, and \(-1\) otherwise. Likewise, \(x_{\text{Age}i} = 1\) if the child is \(i\) years old, and \(0\) otherwise.
Let \(I_{jm}\) be the indicator random variable for whether child \(j\) is infected at age \(m\). For the independent working correlation, \[\mathrm{corr}(I_{jm}, I_{jn}) = 0\] for all \(m \neq n\). For the exchangeable working correlation, \[\mathrm{corr}(I_{jm}, I_{jn}) = \alpha\] for all \(m \neq n\). For the unstructured working correlation, \[\mathrm{corr}(I_{jm}, I_{jn}) = \alpha_{mn}\] for all \(m \neq n\).
We first fit the model under the independence working correlation.
Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27
running glm to get initial regression estimate
(Intercept) smoke1 age1 age2 age3 smoke1:age1
-1.69656029 0.13622036 0.05951185 0.15681009 0.06641432 -0.11504072
smoke1:age2 smoke1:age3
0.06987921 0.02539315
Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept) -1.697 0.062 -27.188 0.090 -18.907
smoke1 0.136 0.062 2.183 0.090 1.518
age1 0.060 0.107 0.557 0.088 0.677
age2 0.157 0.104 1.509 0.081 1.939
age3 0.066 0.106 0.627 0.082 0.810
smoke1:age1 -0.115 0.107 -1.077 0.088 -1.308
smoke1:age2 0.070 0.104 0.673 0.081 0.864
smoke1:age3 0.025 0.106 0.240 0.082 0.310
id and corstr do in gee::gee()
id identifies which rows belong to the same subject/cluster. In this dataset, repeated observations from the same child share the same id.corstr specifies the working correlation pattern within each id cluster.Allowed corstr values in gee include:
"independence": assumes within-subject correlation is 0"fixed": user-supplied fixed working correlation"stat_M_dep" / "non_\text{S}tat_M_dep": not covered in this course"exchangeable": one common correlation for all within subject pairs"AR-M": autoregressive structure Dennis can delete explaination, don’t think this is used"unstructured": each pairwise correlation can differFor argument details and optional tuning parameters, run ?gee.
These values replicate the values in Table 6.8 of Fahrmeir and Tutz (2001) (up to numerical rounding).
We also try the exchangeable and unstructured working correlations.
Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27
running glm to get initial regression estimate
(Intercept) smoke1 age1 age2 age3 smoke1:age1
-1.69656029 0.13622036 0.05951185 0.15681009 0.06641432 -0.11504072
smoke1:age2 smoke1:age3
0.06987921 0.02539315
Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept) -1.697 0.090 -18.947 0.090 -18.907
smoke1 0.136 0.090 1.521 0.090 1.518
age1 0.060 0.086 0.693 0.088 0.677
age2 0.157 0.084 1.877 0.081 1.939
age3 0.066 0.085 0.780 0.082 0.810
smoke1:age1 -0.115 0.086 -1.340 0.088 -1.308
smoke1:age2 0.070 0.084 0.836 0.081 0.864
smoke1:age3 0.025 0.085 0.298 0.082 0.310
Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27
running glm to get initial regression estimate
(Intercept) smoke1 age1 age2 age3 smoke1:age1
-1.69656029 0.13622036 0.05951185 0.15681009 0.06641432 -0.11504072
smoke1:age2 smoke1:age3
0.06987921 0.02539315
Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept) -1.697 0.089 -18.956 0.090 -18.907
smoke1 0.136 0.089 1.522 0.090 1.518
age1 0.060 0.089 0.666 0.088 0.677
age2 0.157 0.081 1.928 0.081 1.939
age3 0.066 0.082 0.806 0.082 0.810
smoke1:age1 -0.115 0.089 -1.288 0.088 -1.308
smoke1:age2 0.070 0.081 0.859 0.081 0.864
smoke1:age3 0.025 0.082 0.308 0.082 0.310
Dennis: Without looking into this, I find this behaviour very weird
We can also extract the robust variance matrices under each working correlation structure.
(Intercept) smoke1 age1 age2 age3 smoke1:age1
(Intercept) 0.008052 0.001856 -0.000403 -0.000424 -0.000080 0.000418
smoke1 0.001856 0.008052 0.000418 -0.000282 -0.000127 -0.000403
age1 -0.000403 0.000418 0.007736 -0.002043 -0.002831 0.001660
age2 -0.000424 -0.000282 -0.002043 0.006542 -0.001200 -0.000575
age3 -0.000080 -0.000127 -0.002831 -0.001200 0.006730 -0.000633
smoke1:age1 0.000418 -0.000403 0.001660 -0.000575 -0.000633 0.007736
smoke1:age2 -0.000282 -0.000424 -0.000575 0.001162 0.000008 -0.002043
smoke1:age3 -0.000127 -0.000080 -0.000633 0.000008 0.001353 -0.002831
smoke1:age2 smoke1:age3
(Intercept) -0.000282 -0.000127
smoke1 -0.000424 -0.000080
age1 -0.000575 -0.000633
age2 0.001162 0.000008
age3 0.000008 0.001353
smoke1:age1 -0.002043 -0.002831
smoke1:age2 0.006542 -0.001200
smoke1:age3 -0.001200 0.006730
(Intercept) smoke1 age1 age2 age3 smoke1:age1
(Intercept) 0.008052 0.001856 -0.000403 -0.000424 -0.000080 0.000418
smoke1 0.001856 0.008052 0.000418 -0.000282 -0.000127 -0.000403
age1 -0.000403 0.000418 0.007736 -0.002043 -0.002831 0.001660
age2 -0.000424 -0.000282 -0.002043 0.006542 -0.001200 -0.000575
age3 -0.000080 -0.000127 -0.002831 -0.001200 0.006730 -0.000633
smoke1:age1 0.000418 -0.000403 0.001660 -0.000575 -0.000633 0.007736
smoke1:age2 -0.000282 -0.000424 -0.000575 0.001162 0.000008 -0.002043
smoke1:age3 -0.000127 -0.000080 -0.000633 0.000008 0.001353 -0.002831
smoke1:age2 smoke1:age3
(Intercept) -0.000282 -0.000127
smoke1 -0.000424 -0.000080
age1 -0.000575 -0.000633
age2 0.001162 0.000008
age3 0.000008 0.001353
smoke1:age1 -0.002043 -0.002831
smoke1:age2 0.006542 -0.001200
smoke1:age3 -0.001200 0.006730
(Intercept) smoke1 age1 age2 age3 smoke1:age1
(Intercept) 0.008052 0.001856 -0.000403 -0.000424 -0.000080 0.000418
smoke1 0.001856 0.008052 0.000418 -0.000282 -0.000127 -0.000403
age1 -0.000403 0.000418 0.007736 -0.002043 -0.002831 0.001660
age2 -0.000424 -0.000282 -0.002043 0.006542 -0.001200 -0.000575
age3 -0.000080 -0.000127 -0.002831 -0.001200 0.006730 -0.000633
smoke1:age1 0.000418 -0.000403 0.001660 -0.000575 -0.000633 0.007736
smoke1:age2 -0.000282 -0.000424 -0.000575 0.001162 0.000008 -0.002043
smoke1:age3 -0.000127 -0.000080 -0.000633 0.000008 0.001353 -0.002831
smoke1:age2 smoke1:age3
(Intercept) -0.000282 -0.000127
smoke1 -0.000424 -0.000080
age1 -0.000575 -0.000633
age2 0.001162 0.000008
age3 0.000008 0.001353
smoke1:age1 -0.002043 -0.002831
smoke1:age2 0.006542 -0.001200
smoke1:age3 -0.001200 0.006730
Eyeballing the three matrices, we can see that they are all almost identical. We can confirm this by computing the difference between the independence and exchangeable robust variance matrices:
(Intercept) smoke1 age1 age2
(Intercept) -2.758210e-16 -2.628106e-16 2.011195e-17 7.453890e-17
smoke1 -2.625938e-16 -2.775558e-16 -3.198396e-17 9.849977e-17
age1 1.734723e-17 -2.986977e-17 -1.118897e-16 1.209970e-16
age2 7.274997e-17 1.014271e-16 1.235990e-16 -2.619432e-16
age3 2.923280e-17 -4.255494e-18 -1.704366e-16 -8.651933e-17
smoke1:age1 -2.992398e-17 1.745565e-17 -3.187554e-16 1.431147e-16
smoke1:age2 1.012645e-16 7.199102e-17 1.427894e-16 -2.803747e-16
smoke1:age3 -4.038653e-18 2.905662e-17 2.873136e-17 1.344411e-17
age3 smoke1:age1 smoke1:age2 smoke1:age3
(Intercept) 1.985445e-17 -3.230922e-17 9.855398e-17 4.526544e-18
smoke1 4.553649e-18 2.049142e-17 7.443048e-17 2.016616e-17
age1 -1.687019e-16 -3.185386e-16 1.440905e-16 2.764716e-17
age2 -8.608565e-17 1.430063e-16 -2.818926e-16 1.349832e-17
age3 1.101549e-16 2.808084e-17 1.342039e-17 -1.788934e-16
smoke1:age1 2.634611e-17 -1.110223e-16 1.222980e-16 -1.695692e-16
smoke1:age2 1.309174e-17 1.214306e-16 -2.602085e-16 -8.760354e-17
smoke1:age3 -1.791102e-16 -1.687019e-16 -8.760354e-17 1.084202e-16
In theory all the robust variance matrices should differ, but here, perhaps due to the nature of the data, they are very close to each other.
We can also access the scale parameter estimate from the model:
We can see that there is very minimal overdispersion here.
We can derive the scale estimate ourselves by computing the method-of-moments dispersion estimate by hand from Pearson residuals:
[1] 2148
The number of Pearson residuals should match the number of observations used in the model fit, which is the same as the number of rows in ohio_tbl. Each row is an observation of a child, where each child is observed multiple times.
[1] 1.003738
The hand-computed value matches the model summary scale.
To verify the naive standard errors reported by gee, we refit with scale.fix = TRUE, which fixes the scale parameter at 1.
Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27
running glm to get initial regression estimate
(Intercept) smoke1 age1 age2 age3 smoke1:age1
-1.69656029 0.13622036 0.05951185 0.15681009 0.06641432 -0.11504072
smoke1:age2 smoke1:age3
0.06987921 0.02539315
[1] 1
We can see that the scale is fixed at 1.
We can check our proposed relationship: \[ \text{Naive SE (scale fixed)} \times \sqrt{\hat\phi} = \text{Naive SE (reported in original fit)}. \]
# A tibble: 8 × 3
term se_rescaled se_original
<chr> <dbl> <dbl>
1 (Intercept) 0.0624 0.0624
2 smoke1 0.0624 0.0624
3 age1 0.107 0.107
4 age2 0.104 0.104
5 age3 0.106 0.106
6 smoke1:age1 0.107 0.107
7 smoke1:age2 0.104 0.104
8 smoke1:age3 0.106 0.106
Since we are using effect coding, we can extract the coefficient for for age 10: \[ \beta_\text{Age10} = - (\beta_\text{Age7} + \beta_\text{Age8} + \beta_\text{Age9}), \] since the age 10 group is the reference group for the age factor.
Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept) -1.697 0.062 -27.188 0.090 -18.907
smoke1 0.136 0.062 2.183 0.090 1.518
age1 0.060 0.107 0.557 0.088 0.677
age2 0.157 0.104 1.509 0.081 1.939
age3 0.066 0.106 0.627 0.082 0.810
smoke1:age1 -0.115 0.107 -1.077 0.088 -1.308
smoke1:age2 0.070 0.104 0.673 0.081 0.864
smoke1:age3 0.025 0.106 0.240 0.082 0.310
[1] -0.2827363
This value matches page 273 of Fahrmeir and Tutz (2001).
coefs_indep Notation
coefs_indep is a matrix, with model terms in rows and summary statistics in columns.
coefs_indep["age1", "Estimate"] gets one cellcoefs_indep[c("age1", "age2", "age3"), "Estimate"] gets a vector of selected rows from one columncoefs_indep[, "Naive S.E."] gets the full Naive S.E. columnIn matrix indexing, the format is always [rows, columns].
The probability of infection for a child who is age 10 is lower than the average probability across all age groups, when smoking status is fixed.
Fahrmeir and Tutz (2001) also reports a standard deviation for this coefficient, which we can also reproduce. The robust standard error for this linear combination is:
age1 age2 age3
age1 0.0077 -0.0020 -0.0028
age2 -0.0020 0.0065 -0.0012
age3 -0.0028 -0.0012 0.0067
[,1]
[1,] 0.09411951
The robust z-values (or alternatively the estimates and robust standard errors) reported in the model output tell us that the interaction terms are not statistically significant.
Therefore we fit models without the interaction terms, and try to recover the main effect coefficients in Fahrmeir and Tutz (2001) Table 6.9.
Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27
running glm to get initial regression estimate
(Intercept) smoke1 age1 age2 age3
-1.69503955 0.13623130 0.08731839 0.14133045 0.05956087
Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept) -1.695 0.062 -27.221 0.090 -18.920
smoke1 0.136 0.062 2.203 0.089 1.529
age1 0.087 0.103 0.849 0.086 1.015
age2 0.141 0.102 1.390 0.079 1.779
age3 0.060 0.104 0.575 0.080 0.742
The parameter estimates as well as robust and naive standard errors agree with Table 6.9 of Fahrmeir and Tutz (2001).
For the second column in Table 6.9, we do the same for the exchangeable and unstructured working correlations:
Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27
running glm to get initial regression estimate
(Intercept) smoke1 age1 age2 age3
-1.69503955 0.13623130 0.08731839 0.14133045 0.05956087
Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27
running glm to get initial regression estimate
(Intercept) smoke1 age1 age2 age3
-1.69503955 0.13623130 0.08731839 0.14133045 0.05956087
(Intercept) smoke1 age1 age2 age3
-1.695 0.136 0.087 0.141 0.060
(Intercept) smoke1 age1 age2 age3
-1.696 0.131 0.087 0.141 0.060
The unstructured coefficients agree (up to rounding) with the second column of Table 6.9 in Fahrmeir and Tutz (2001). Dennis I’m unsure why the column in F&T is called Exchangeable/Unspecified
Here we can see that the robust variances differ more across working correlation structures than in the interaction model. This supports the suspicion that the earlier agreement was a biproduct of the data.
(Intercept) smoke1 age1 age2 age3
(Intercept) 0.008026 0.001856 -0.000653 -0.000287 -0.000005
smoke1 0.001856 0.007940 -0.000253 0.000224 0.000081
age1 -0.000653 -0.000253 0.007407 -0.001895 -0.002703
age2 -0.000287 0.000224 -0.001895 0.006308 -0.001235
age3 -0.000005 0.000081 -0.002703 -0.001235 0.006448
(Intercept) smoke1 age1 age2 age3
(Intercept) 0.008029 0.001863 -0.000658 -0.000284 -0.000001
smoke1 0.001863 0.007929 -0.000278 0.000236 0.000102
age1 -0.000658 -0.000278 0.007407 -0.001895 -0.002703
age2 -0.000284 0.000236 -0.001895 0.006308 -0.001235
age3 -0.000001 0.000102 -0.002703 -0.001235 0.006448
(Intercept) smoke1 age1 age2 age3
(Intercept) 0.008044 0.001900 -0.000665 -0.000284 0.000000
smoke1 0.001900 0.007918 -0.000300 0.000230 0.000103
age1 -0.000665 -0.000300 0.007404 -0.001894 -0.002702
age2 -0.000284 0.000230 -0.001894 0.006305 -0.001235
age3 0.000000 0.000103 -0.002702 -0.001235 0.006445
R version 4.4.2 (2024-10-31)
Platform: x86_64-apple-darwin20
Running under: macOS Sequoia 15.7.5
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Australia/Melbourne
tzcode source: internal
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] dplyr_1.2.0 geepack_1.3.13 gee_4.13-29 conflicted_1.2.0
loaded via a namespace (and not attached):
[1] vctrs_0.7.1 cli_3.6.5 knitr_1.51 rlang_1.1.7
[5] xfun_0.56 purrr_1.2.1 renv_1.1.7 generics_0.1.4
[9] jsonlite_2.0.0 glue_1.8.0 backports_1.5.0 htmltools_0.5.9
[13] rmarkdown_2.30 evaluate_1.0.5 tibble_3.3.1 MASS_7.3-61
[17] fastmap_1.2.0 yaml_2.3.12 lifecycle_1.0.5 memoise_2.0.1
[21] compiler_4.4.2 pkgconfig_2.0.3 tidyr_1.3.2 rstudioapi_0.18.0
[25] digest_0.6.39 R6_2.6.1 utf8_1.2.6 tidyselect_1.2.1
[29] pillar_1.11.1 magrittr_2.0.4 tools_4.4.2 broom_1.0.12
[33] cachem_1.1.0
packages <- c("conflicted", "gee", "geepack", "dplyr")
for (pkg in packages) {
cit <- citation(pkg)
# Check if multiple citations
if (length(cit) > 1) {
cat("- **", pkg, "**:\n", sep = "")
for (i in seq_along(cit)) {
cit_text <- format(cit[i], style = "text")
cat(" - ", cit_text, "\n\n", sep = "")
}
} else {
cit_text <- format(cit, style = "text")
cat("- **", pkg, "**: ", cit_text, "\n\n", sep = "")
}
}conflicted: Wickham H (2023). conflicted: An Alternative Conflict Resolution Strategy. R package version 1.2.0, https://github.com/r-lib/conflicted, https://conflicted.r-lib.org/.
gee: Carey VJ (2024). gee: Generalized Estimation Equation Solver. R package version 4.13-29.
geepack:
Halekoh U, Højsgaard S, Yan J (2006). “The R Package geepack for Generalized Estimating Equations.” Journal of Statistical Software, 15/2, 1-11.
Yan J, Fine JP (2004). “Estimating Equations for Association Structures.” Statistics in Medicine, 23, 859-880.
Yan J (2002). “geepack: Yet Another Package for Generalized Estimating Equations.” R-News, 2/3, 12-14.
Xu, Z., Fine, P. J, Song, W., Yan, J. (2025). “On GEE for mean-variance-correlation models: Variance estimation and model selection.” Statistics in Medicine, 44, 1-2.
dplyr: Wickham H, François R, Henry L, Müller K, Vaughan D (2026). dplyr: A Grammar of Data Manipulation. R package version 1.2.0, https://github.com/tidyverse/dplyr, https://dplyr.tidyverse.org.