---
title: "Introduction to the Ebrahim-Farrington Goodness-of-Fit Test"
author: "Ebrahim Khaled Ebrahim"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to the Ebrahim-Farrington Goodness-of-Fit Test}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5
)
```

## Introduction

The **ebrahim.gof** package implements the Ebrahim-Farrington goodness-of-fit test for logistic regression models. This test is particularly effective for binary data and sparse datasets, providing an improved alternative to the traditional Hosmer-Lemeshow test.

## Background and Motivation

Goodness-of-fit testing is crucial in logistic regression to assess whether the fitted model adequately describes the data. The most commonly used test is the Hosmer-Lemeshow test, but it has several limitations:

1. **Limited power** for detecting certain types of model misspecification
2. **Dependency on grouping strategy** which can affect results
3. **Poor performance** with sparse data or continuous covariates

The Ebrahim-Farrington test addresses these limitations by using a modified Pearson chi-square statistic based on Farrington's (1996) theoretical framework, but simplified for practical implementation with binary data.

## Installation and Loading

```{r eval=FALSE}
# Install from GitHub
devtools::install_github("ebrahimkhaled/ebrahim.gof")

# Load the package
library(ebrahim.gof)
```

```{r message=FALSE}
library(ebrahim.gof)
```

## Basic Usage

The main function `ef.gof()` performs the goodness-of-fit test:

```{r}
# Simulate binary data
set.seed(123)
n <- 500
x <- rnorm(n)
linpred <- 0.5 + 1.2 * x
prob <- plogis(linpred)  # Convert to probabilities
y <- rbinom(n, 1, prob)

# Fit logistic regression
model <- glm(y ~ x, family = binomial())
predicted_probs <- fitted(model)

# Perform Ebrahim-Farrington test
result <- ef.gof(y, predicted_probs, G = 10)
print(result)
```

## Understanding the Test Statistic

For binary data with automatic grouping, the Ebrahim-Farrington test statistic is:

$$Z_{EF} = \frac{T_{EF} - (G - 2)}{\sqrt{2(G-2)}}$$

Where:
- $T_{EF}$ is the modified Pearson chi-square statistic
- $G$ is the number of groups
- The test statistic follows a standard normal distribution under $H_0$

The null hypothesis is that the model fits the data adequately.

## Comparing with Different Group Numbers

The number of groups $G$ can affect the test's performance:

```{r}
# Test with different numbers of groups
group_sizes <- c(4, 8, 10, 15, 20)
results <- data.frame(
  Groups = group_sizes,
  P_value = sapply(group_sizes, function(g) {
    ef.gof(y, predicted_probs, G = g)$p_value
  })
)
print(results)
```

## Comparison with Hosmer-Lemeshow Test

Let's compare the Ebrahim-Farrington test with the traditional Hosmer-Lemeshow test:

```{r}
# Hosmer-Lemeshow test (requires ResourceSelection package)
if (requireNamespace("ResourceSelection", quietly = TRUE)) {
  library(ResourceSelection)
  
  # Perform both tests
  ef_result <- ef.gof(y, predicted_probs, G = 10)
  hl_result <- hoslem.test(y, predicted_probs, g = 10)
  
  # Compare results
  comparison <- data.frame(
    Test = c("Ebrahim-Farrington", "Hosmer-Lemeshow"),
    P_value = c(ef_result$p_value, hl_result$p.value),
    Test_Statistic = c(ef_result$Test_Statistic, hl_result$statistic)
  )
  print(comparison)
} else {
  cat("ResourceSelection package not available for comparison\n")
}
```

## New in 2.0.0: Directed test, ensemble, and the full battery

Version 2.0.0 turns the package into a full goodness-of-fit toolkit. The
**Directed EF (DEF)** test concentrates power on calibration-curve shape
directions, `def.ensemble.gof()` combines the DEF bases via the Cauchy
combination test, and `run.all.gof()` runs a whole battery of GOF tests at once.

```{r}
# Directed Ebrahim-Farrington test (takes the fitted model)
def.gof(model)                       # default poly3 basis
def.gof(model, basis = "ensemble")   # combine all three bases (Cauchy)

# Ensemble of the three DEF bases
def.ensemble.gof(model)
def.ensemble.gof(model, add_ef = TRUE)   # add the omnibus EF
```

`run.all.gof()` returns one tidy data frame, one row per test:

```{r}
run.all.gof(model)
```

Add `include_slow = TRUE` to also run the opt-in slow tests (le Cessie, the
GAM-based tests, Stute-Zhu, eHL, BAGofT, and the Lai & Liu standardized-power
test), or pass `tests = c("EF", "DEF.poly3", "HL")` to run a chosen subset.

### Powerful, but not liberal

Most GOF tests for logistic regression are **partition-based** (they group the
data and compare observed with expected counts), and that is the family `ef.gof()`
and `def.gof()` belong to. A key property of the directed tests is that they gain
power **without** inflating the type I error rate. In a Monte Carlo study (n = 500,
1000 replications, α = 0.05), the partition tests compare as follows:

| Test | Size (null) | Power: quadratic | Power: wrong link |
|---|:---:|:---:|:---:|
| Hosmer–Lemeshow (decile) | 0.060 | 0.588 | 0.179 |
| Hosmer–Lemeshow (equal-width) | 0.053 | 0.332 | 0.244 |
| Pigeon–Heyse | 0.035 | 0.535 | 0.133 |
| EF (omnibus) | 0.058 | 0.480 | 0.218 |
| Tsiatis | 0.056 | 0.574 | 0.162 |
| Xie | 0.042 | 0.557 | 0.147 |
| DEF (poly3) | 0.060 | 0.709 | 0.404 |
| DEF (ensemble, vote) | 0.066 | 0.767 | 0.468 |

DEF and its vote ensemble are the most powerful in the family while keeping the
size near the nominal 0.05 — they are not liberal — and they roughly double the
power of Hosmer–Lemeshow, Tsiatis, and Xie on the wrong-link misfit.

Partition-based tests are intuitive and work for sparse data and continuous
covariates (where the Pearson and deviance chi-square tests fail), but their
result depends on the grouping choice and the simpler members (HL) can have low
power. DEF keeps the intuitive fitted-probability grouping but *directs* the test
at calibration-curve shapes, which is why it tops the table without losing size
control.

> Note: as of 2.0.0, `ef.gof()` defaults to the chi-square reference
> (`method = "chisq"`); use `method = "normal"` for the version 1.0.0 behaviour.

## Power Analysis Example

Let's examine the power of the test to detect model misspecification:

```{r}
# Function to simulate power under model misspecification
simulate_power <- function(n, beta_quad = 0.1, n_sims = 100, G = 10) {
  rejections_ef <- 0
  rejections_hl <- 0
  
  for (i in 1:n_sims) {
    # Generate data with quadratic term (true model)
    x <- runif(n, -2, 2)
    linpred_true <- 0 + x + beta_quad * x^2
    prob_true <- plogis(linpred_true)
    y <- rbinom(n, 1, prob_true)
    
    # Fit misspecified linear model (omitting quadratic term)
    model_mis <- glm(y ~ x, family = binomial())
    pred_probs <- fitted(model_mis)
    
    # Ebrahim-Farrington test
    ef_test <- ef.gof(y, pred_probs, G = G)
    if (ef_test$p_value < 0.05) rejections_ef <- rejections_ef + 1
    
    # Hosmer-Lemeshow test (if available)
    if (requireNamespace("ResourceSelection", quietly = TRUE)) {
      hl_test <- ResourceSelection::hoslem.test(y, pred_probs, g = G)
      if (hl_test$p.value < 0.05) rejections_hl <- rejections_hl + 1
    }
  }
  
  power_ef <- rejections_ef / n_sims
  power_hl <- if (requireNamespace("ResourceSelection", quietly = TRUE)) {
    rejections_hl / n_sims
  } else {
    NA
  }
  
  return(list(power_ef = power_ef, power_hl = power_hl))
}

# Calculate power for different sample sizes
sample_sizes <- c(200, 500, 1000)
power_results <- data.frame(
  n = sample_sizes,
  EbrahimFarrington_Power = sapply(sample_sizes, function(n) {
    simulate_power(n, beta_quad = 0.15, n_sims = 50)$power_ef
  })
)

if (requireNamespace("ResourceSelection", quietly = TRUE)) {
  power_results$HosmerLemeshow_Power <- sapply(sample_sizes, function(n) {
    simulate_power(n, beta_quad = 0.15, n_sims = 50)$power_hl
  })
}

print(power_results)
```

## Handling Grouped Data (Original Farrington Test)

For datasets with grouped observations (multiple trials per covariate pattern), you can use the original Farrington test:

```{r}
# Simulate grouped data
set.seed(456)
n_groups <- 30
m_trials <- sample(5:20, n_groups, replace = TRUE)
x_grouped <- rnorm(n_groups)
prob_grouped <- plogis(0.2 + 0.8 * x_grouped)
y_grouped <- rbinom(n_groups, m_trials, prob_grouped)

# Create data frame and fit model
data_grouped <- data.frame(
  successes = y_grouped,
  trials = m_trials,
  x = x_grouped
)

model_grouped <- glm(
  cbind(successes, trials - successes) ~ x,
  data = data_grouped,
  family = binomial()
)

predicted_probs_grouped <- fitted(model_grouped)

# Original Farrington test for grouped data
result_grouped <- ef.gof(
  y_grouped,
  predicted_probs_grouped,
  model = model_grouped,
  m = m_trials,
  G = NULL  # No automatic grouping for original test
)

print(result_grouped)
```

## Practical Guidelines

### When to Use Each Test Mode

1. **Ebrahim-Farrington mode** (`G` specified):
   - Binary response data (0/1)
   - Want automatic grouping
   - Computationally efficient
   - Recommended for most applications

2. **Original Farrington mode** (`m` provided, `G = NULL`):
   - Grouped binomial data
   - Multiple trials per covariate pattern
   - Requires fitted model object

### Choosing the Number of Groups

- **G = 10**: Standard choice, comparable to Hosmer-Lemeshow
- **G = 4-8**: For smaller datasets (n < 200)
- **G = 20+**: For larger datasets (n > 20000)
- **Rule of thumb**: Ensure each group sample size is large enough.

### Interpreting Results

- **p-value > 0.05**: No evidence of lack of fit (fail to reject H₀)
- **p-value ≤ 0.05**: Evidence of model misspecification (reject H₀)
- **Very small p-values**: Strong evidence against model adequacy

## Advantages over Hosmer-Lemeshow Test

1. **Better Power**: More sensitive to model misspecification
2. **Theoretical Foundation**: Based on rigorous asymptotic theory
3. **Sparse Data Handling**: Specifically designed for fully sparse data
4. **Computational Efficiency**: Simplified calculations for binary data

## Limitations and Considerations

1. **Group Selection**: Results can vary with different numbers of groups
2. **Sample Size**: More reliable with larger sample sizes (n ≥ 100)
3. **Model Complexity**: Performance with highly complex models needs further study

## References

1. Farrington, C. P. (1996). On Assessing Goodness of Fit of Generalized Linear Models to Sparse Data. *Journal of the Royal Statistical Society. Series B (Methodological)*, 58(2), 349-360.

2. Ebrahim, Khaled Ebrahim (2025). Goodness-of-Fits Tests and Calibration Machine Learning Algorithms for Logistic Regression Model with Sparse Data. *Master's Thesis*, Alexandria University.

3. Hosmer, D. W., & Lemeshow, S. (2000). Applied Logistic Regression, Second Edition. New York: Wiley.


4. Hosmer, D. W., & Lemeshow, S. (1980). A goodness-of-fit test for the multiple logistic regression model. *Communications in Statistics - Theory and Methods*, 9(10), 1043–1069. https://doi.org/10.1080/03610928008827941

The Ebrahim-Farrington test provides a powerful and practical tool for assessing goodness-of-fit in logistic regression, particularly for binary data and sparse datasets. Its simplified implementation makes it accessible for routine use while maintaining strong theoretical foundations.