--- title: "Introduction to the Ebrahim-Farrington Goodness-of-Fit Test" author: "Ebrahim Khaled Ebrahim" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to the Ebrahim-Farrington Goodness-of-Fit Test} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5 ) ``` ## Introduction The **ebrahim.gof** package implements the Ebrahim-Farrington goodness-of-fit test for logistic regression models. This test is particularly effective for binary data and sparse datasets, providing an improved alternative to the traditional Hosmer-Lemeshow test. ## Background and Motivation Goodness-of-fit testing is crucial in logistic regression to assess whether the fitted model adequately describes the data. The most commonly used test is the Hosmer-Lemeshow test, but it has several limitations: 1. **Limited power** for detecting certain types of model misspecification 2. **Dependency on grouping strategy** which can affect results 3. **Poor performance** with sparse data or continuous covariates The Ebrahim-Farrington test addresses these limitations by using a modified Pearson chi-square statistic based on Farrington's (1996) theoretical framework, but simplified for practical implementation with binary data. ## Installation and Loading ```{r eval=FALSE} # Install from GitHub devtools::install_github("ebrahimkhaled/ebrahim.gof") # Load the package library(ebrahim.gof) ``` ```{r message=FALSE} library(ebrahim.gof) ``` ## Basic Usage The main function `ef.gof()` performs the goodness-of-fit test: ```{r} # Simulate binary data set.seed(123) n <- 500 x <- rnorm(n) linpred <- 0.5 + 1.2 * x prob <- plogis(linpred) # Convert to probabilities y <- rbinom(n, 1, prob) # Fit logistic regression model <- glm(y ~ x, family = binomial()) predicted_probs <- fitted(model) # Perform Ebrahim-Farrington test result <- ef.gof(y, predicted_probs, G = 10) print(result) ``` ## Understanding the Test Statistic For binary data with automatic grouping, the Ebrahim-Farrington test statistic is: $$Z_{EF} = \frac{T_{EF} - (G - 2)}{\sqrt{2(G-2)}}$$ Where: - $T_{EF}$ is the modified Pearson chi-square statistic - $G$ is the number of groups - The test statistic follows a standard normal distribution under $H_0$ The null hypothesis is that the model fits the data adequately. ## Comparing with Different Group Numbers The number of groups $G$ can affect the test's performance: ```{r} # Test with different numbers of groups group_sizes <- c(4, 8, 10, 15, 20) results <- data.frame( Groups = group_sizes, P_value = sapply(group_sizes, function(g) { ef.gof(y, predicted_probs, G = g)$p_value }) ) print(results) ``` ## Comparison with Hosmer-Lemeshow Test Let's compare the Ebrahim-Farrington test with the traditional Hosmer-Lemeshow test: ```{r} # Hosmer-Lemeshow test (requires ResourceSelection package) if (requireNamespace("ResourceSelection", quietly = TRUE)) { library(ResourceSelection) # Perform both tests ef_result <- ef.gof(y, predicted_probs, G = 10) hl_result <- hoslem.test(y, predicted_probs, g = 10) # Compare results comparison <- data.frame( Test = c("Ebrahim-Farrington", "Hosmer-Lemeshow"), P_value = c(ef_result$p_value, hl_result$p.value), Test_Statistic = c(ef_result$Test_Statistic, hl_result$statistic) ) print(comparison) } else { cat("ResourceSelection package not available for comparison\n") } ``` ## New in 2.0.0: Directed test, ensemble, and the full battery Version 2.0.0 turns the package into a full goodness-of-fit toolkit. The **Directed EF (DEF)** test concentrates power on calibration-curve shape directions, `def.ensemble.gof()` combines the DEF bases via the Cauchy combination test, and `run.all.gof()` runs a whole battery of GOF tests at once. ```{r} # Directed Ebrahim-Farrington test (takes the fitted model) def.gof(model) # default poly3 basis def.gof(model, basis = "ensemble") # combine all three bases (Cauchy) # Ensemble of the three DEF bases def.ensemble.gof(model) def.ensemble.gof(model, add_ef = TRUE) # add the omnibus EF ``` `run.all.gof()` returns one tidy data frame, one row per test: ```{r} run.all.gof(model) ``` Add `include_slow = TRUE` to also run the opt-in slow tests (le Cessie, the GAM-based tests, Stute-Zhu, eHL, BAGofT, and the Lai & Liu standardized-power test), or pass `tests = c("EF", "DEF.poly3", "HL")` to run a chosen subset. ### Powerful, but not liberal Most GOF tests for logistic regression are **partition-based** (they group the data and compare observed with expected counts), and that is the family `ef.gof()` and `def.gof()` belong to. A key property of the directed tests is that they gain power **without** inflating the type I error rate. In a Monte Carlo study (n = 500, 1000 replications, α = 0.05), the partition tests compare as follows: | Test | Size (null) | Power: quadratic | Power: wrong link | |---|:---:|:---:|:---:| | Hosmer–Lemeshow (decile) | 0.060 | 0.588 | 0.179 | | Hosmer–Lemeshow (equal-width) | 0.053 | 0.332 | 0.244 | | Pigeon–Heyse | 0.035 | 0.535 | 0.133 | | EF (omnibus) | 0.058 | 0.480 | 0.218 | | Tsiatis | 0.056 | 0.574 | 0.162 | | Xie | 0.042 | 0.557 | 0.147 | | DEF (poly3) | 0.060 | 0.709 | 0.404 | | DEF (ensemble, vote) | 0.066 | 0.767 | 0.468 | DEF and its vote ensemble are the most powerful in the family while keeping the size near the nominal 0.05 — they are not liberal — and they roughly double the power of Hosmer–Lemeshow, Tsiatis, and Xie on the wrong-link misfit. Partition-based tests are intuitive and work for sparse data and continuous covariates (where the Pearson and deviance chi-square tests fail), but their result depends on the grouping choice and the simpler members (HL) can have low power. DEF keeps the intuitive fitted-probability grouping but *directs* the test at calibration-curve shapes, which is why it tops the table without losing size control. > Note: as of 2.0.0, `ef.gof()` defaults to the chi-square reference > (`method = "chisq"`); use `method = "normal"` for the version 1.0.0 behaviour. ## Power Analysis Example Let's examine the power of the test to detect model misspecification: ```{r} # Function to simulate power under model misspecification simulate_power <- function(n, beta_quad = 0.1, n_sims = 100, G = 10) { rejections_ef <- 0 rejections_hl <- 0 for (i in 1:n_sims) { # Generate data with quadratic term (true model) x <- runif(n, -2, 2) linpred_true <- 0 + x + beta_quad * x^2 prob_true <- plogis(linpred_true) y <- rbinom(n, 1, prob_true) # Fit misspecified linear model (omitting quadratic term) model_mis <- glm(y ~ x, family = binomial()) pred_probs <- fitted(model_mis) # Ebrahim-Farrington test ef_test <- ef.gof(y, pred_probs, G = G) if (ef_test$p_value < 0.05) rejections_ef <- rejections_ef + 1 # Hosmer-Lemeshow test (if available) if (requireNamespace("ResourceSelection", quietly = TRUE)) { hl_test <- ResourceSelection::hoslem.test(y, pred_probs, g = G) if (hl_test$p.value < 0.05) rejections_hl <- rejections_hl + 1 } } power_ef <- rejections_ef / n_sims power_hl <- if (requireNamespace("ResourceSelection", quietly = TRUE)) { rejections_hl / n_sims } else { NA } return(list(power_ef = power_ef, power_hl = power_hl)) } # Calculate power for different sample sizes sample_sizes <- c(200, 500, 1000) power_results <- data.frame( n = sample_sizes, EbrahimFarrington_Power = sapply(sample_sizes, function(n) { simulate_power(n, beta_quad = 0.15, n_sims = 50)$power_ef }) ) if (requireNamespace("ResourceSelection", quietly = TRUE)) { power_results$HosmerLemeshow_Power <- sapply(sample_sizes, function(n) { simulate_power(n, beta_quad = 0.15, n_sims = 50)$power_hl }) } print(power_results) ``` ## Handling Grouped Data (Original Farrington Test) For datasets with grouped observations (multiple trials per covariate pattern), you can use the original Farrington test: ```{r} # Simulate grouped data set.seed(456) n_groups <- 30 m_trials <- sample(5:20, n_groups, replace = TRUE) x_grouped <- rnorm(n_groups) prob_grouped <- plogis(0.2 + 0.8 * x_grouped) y_grouped <- rbinom(n_groups, m_trials, prob_grouped) # Create data frame and fit model data_grouped <- data.frame( successes = y_grouped, trials = m_trials, x = x_grouped ) model_grouped <- glm( cbind(successes, trials - successes) ~ x, data = data_grouped, family = binomial() ) predicted_probs_grouped <- fitted(model_grouped) # Original Farrington test for grouped data result_grouped <- ef.gof( y_grouped, predicted_probs_grouped, model = model_grouped, m = m_trials, G = NULL # No automatic grouping for original test ) print(result_grouped) ``` ## Practical Guidelines ### When to Use Each Test Mode 1. **Ebrahim-Farrington mode** (`G` specified): - Binary response data (0/1) - Want automatic grouping - Computationally efficient - Recommended for most applications 2. **Original Farrington mode** (`m` provided, `G = NULL`): - Grouped binomial data - Multiple trials per covariate pattern - Requires fitted model object ### Choosing the Number of Groups - **G = 10**: Standard choice, comparable to Hosmer-Lemeshow - **G = 4-8**: For smaller datasets (n < 200) - **G = 20+**: For larger datasets (n > 20000) - **Rule of thumb**: Ensure each group sample size is large enough. ### Interpreting Results - **p-value > 0.05**: No evidence of lack of fit (fail to reject H₀) - **p-value ≤ 0.05**: Evidence of model misspecification (reject H₀) - **Very small p-values**: Strong evidence against model adequacy ## Advantages over Hosmer-Lemeshow Test 1. **Better Power**: More sensitive to model misspecification 2. **Theoretical Foundation**: Based on rigorous asymptotic theory 3. **Sparse Data Handling**: Specifically designed for fully sparse data 4. **Computational Efficiency**: Simplified calculations for binary data ## Limitations and Considerations 1. **Group Selection**: Results can vary with different numbers of groups 2. **Sample Size**: More reliable with larger sample sizes (n ≥ 100) 3. **Model Complexity**: Performance with highly complex models needs further study ## References 1. Farrington, C. P. (1996). On Assessing Goodness of Fit of Generalized Linear Models to Sparse Data. *Journal of the Royal Statistical Society. Series B (Methodological)*, 58(2), 349-360. 2. Ebrahim, Khaled Ebrahim (2025). Goodness-of-Fits Tests and Calibration Machine Learning Algorithms for Logistic Regression Model with Sparse Data. *Master's Thesis*, Alexandria University. 3. Hosmer, D. W., & Lemeshow, S. (2000). Applied Logistic Regression, Second Edition. New York: Wiley. 4. Hosmer, D. W., & Lemeshow, S. (1980). A goodness-of-fit test for the multiple logistic regression model. *Communications in Statistics - Theory and Methods*, 9(10), 1043–1069. https://doi.org/10.1080/03610928008827941 The Ebrahim-Farrington test provides a powerful and practical tool for assessing goodness-of-fit in logistic regression, particularly for binary data and sparse datasets. Its simplified implementation makes it accessible for routine use while maintaining strong theoretical foundations.