Quick Facts

Mallows Model Averaging (MMA) (Hansen 2007) combines predictions from multiple models to minimize the mean squared prediction error (MSE), balancing predictive accuracy against model complexity. Hansen demonstrates that MMA is asymptotically optimal, achieving the lowest possible squared error within a class of discrete model averaging estimators.

MMA is particularly useful when we want to account for model uncertainty, as it combines estimates from multiple candidate models instead of committing to a single “best” model, which can be risky in the presence of competing plausible models. So unlike model selection methods like Mallows’ $C$ (often denoted $C_p$), AIC or BIC, which may lead to biased or overly confident predictions, MMA produces robust, stable estimators by weighting models based on their fit and complexity.

MMA operates within the homoskedastic linear regression framework,

$$ y = m(\boldsymbol{X}) + \epsilon, \quad \textup{E}(\epsilon\vert \boldsymbol{X})=0 , \quad \textup{E}(\epsilon^2\vert \boldsymbol{X}) = \sigma^2, $$

with $m(\boldsymbol{X})$ a regression function, representing $M$ models. Given $M$ candidate models, each having a regression matrix $\boldsymbol{X}_m$, the model averaging estimator is defined as:1

$$ \widehat{m}(\boldsymbol{w}) = \sum_{m=1}^M w_m \boldsymbol{P}_m \boldsymbol{Y} = \boldsymbol{P}(\boldsymbol{w}) \boldsymbol{Y}, $$

where $\boldsymbol{P}_m = \boldsymbol{X}_m (\boldsymbol{X}_m’ \boldsymbol{X}_m)^{-1} \boldsymbol{X}_m’$ is the projection matrix for model $m$, and $w_m$, $1,\dots,m$ are the weights assigned to each model. The residuals for the averaging estimator are given by:

$$ \widehat{e}(\boldsymbol{w}) = (I - P(\boldsymbol{w})) \boldsymbol{Y} = \sum_{m=1}^M w_m (I - \boldsymbol{P}_m) \boldsymbol{Y}. $$

The MMA criterion is

$$ C(\boldsymbol{w}) = \widehat{e}(\boldsymbol{w})’ \widehat{e}(\boldsymbol{w}) + 2\sigma^2 \sum_{m=1}^M w_m K_m, $$

where $K_m$ is the number of estimated parameters (degrees of freedom) for model $m$, and $\sigma^2$ is a preliminary estimate of the error term variance (from the largest model). $C(\boldsymbol{w})$ penalizes model complexity via a weighted sum of parameter counts, ensuring a trade-off between fit and simplicity.

$\boldsymbol{w}$ is for ‘wanted’

The optimal weight vector $\widehat{\boldsymbol{w}}_{\text{MMA}}$ minimizes $C(\boldsymbol{w})$ under the constraints:

$$ \sum_{m=1}^M w_m = 1, \quad w_m \geq 0 \, \forall \, m. $$

This minimization problem falls under quadratic programming, as $C(\boldsymbol{w})$ is quadratic in $\boldsymbol{w}$. We may write this compactly as

$$ \begin{align} \widehat{\boldsymbol{w}}_{\text{MMA}} =\arg\min_{\boldsymbol{w}} \boldsymbol{w}’ \mathbf{D} \boldsymbol{w} + 2\sigma^2 \mathbf{K}’ \boldsymbol{w} \quad \text{s.t.} \quad \mathbf{1}’ \boldsymbol{w} = 1, \quad \boldsymbol{w} \geq \mathbf{0}.\label{eq:qpm} \end{align} $$

Here $\mathbf{D}:=\widehat{E}’\widehat{E}$ with $\widehat{E}:=[\widehat{e}_1, \dots \widehat{e}_M]$ is an estimate of the covariance matrix of the error terms, computed from the residuals of the fitted candidate models $\widehat{e}_1, \dots \widehat{e}_M$. The vector $\boldsymbol{K}$ stacks the model complexities $K_m$, $m=1,\dots,M$.

The solution \eqref{eq:qpm} often results in sparsity, with positive weights assigned to only a subset of models.2

A Python example

As an example, we simulate a dataset based on a cubic relationship with some noise,

$$ y = .5 x_1 + 3x_2^3 + \epsilon, \quad \epsilon \sim N(0, 0.5^2). $$

import numpy as np

# seed
np.random.seed(123)

# DGP: Nonlinear relationship with mixed-degree terms
n = 100
x1 = np.random.uniform(-1, 1, n)
x2 = np.random.uniform(-1, 1, n)

y = (
      .5 * x1 
    + 3 * x2**3
    + np.random.normal(0, 0.5, n)
)

Here’s a plot of the true DGP surface along with the simulated data points: The surface shows the function $y = 0.5x_1 + 3x_2^3$ in three dimensions, while the red dots represent our noisy observations.

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# grid of points
x1_grid = np.linspace(-1, 1, 50)
x2_grid = np.linspace(-1, 1, 50)
X1, X2 = np.meshgrid(x1_grid, x2_grid)

Z = 0.5 * X1 + 3 * X2**3

# 3D plot
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
ax.view_init(elev=15, azim=-10)

# surface
srf = ax.plot_surface(X1, X2, Z, cmap='viridis', alpha=0.7)

# simulated data points
ax.scatter(x1, x2, y, c='red', marker='o', s=50, label='Simulated Data')

# labels and colorbar
ax.set_xlabel('x1')
ax.set_ylabel('x2')
ax.set_zlabel('y')
plt.colorbar(srf)
plt.legend()
plt.show()

Figure 1: True Regression Function and Sample data

Unsurprisingly, the data suggest a nonlinear relationship between ($X_1$, $X_2$) and $y$, motivating the use a regression model that includes nonlinear transformations of the predictors. Determining which nonlinear transformations to include is a common challenge in model selection. Even when restricting to polynomial terms of some fixed maximum degree, the number of possible models (including all submodels!) grows quickly with $p$, the number of predictors.3 This rapid increase in complexity and thus uncertainty of the model selection exercise can be problematic, especially when the sample size is small.

We now demonstrate MMA in Python using functions from numpy, sklearn and quadprog. For this we first generate numpy arrays of training and test splits of the simulated data.

from sklearn.model_selection import train_test_split

# split mock data into train and test sets
X = np.column_stack([x1, x2])
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1234
)

Next, we fit linear regression models up to order $p=6$ using the training data and save these in a list models.

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

p = 6  # max poly degree

# fit models
models = [
    LinearRegression().fit(
        PolynomialFeatures
        (
            degree=d, 
            include_bias=False
        ).fit_transform(X_train),
        y_train
    ) for d in range(1, p + 1)
]

To compute the MMA weights, we need in-sample residuals.

# residuals and model complexity
residuals = []
X = np.column_stack([x1, x2])

for i, model in enumerate(models):
    poly = PolynomialFeatures(degree=i+1, include_bias=False)
    X_poly = poly.fit_transform(X)
    pred = model.predict(X_poly)
    residuals.append(y - pred)

residuals = np.column_stack(residuals)

The quantities needed to tackle the quadratic programming problem (QPP) for finding the MMA weights (see \eqref{eq:qpm}) are defined next.

# model complexities (DoF)
model_complexities = 2 * np.arange(1, p + 1)

# VCV matrix of model residuals
D = residuals.T @ residuals

# penalties
sigma_squared = np.var(y)
d = -sigma_squared * model_complexities

# constraints for QPP
M = len(models)
A = np.vstack([np.ones(M), np.eye(M)]).T
b = np.hstack([1, np.zeros(M)])

To solve the QPP using the quadprog library, the objective function is reformulated as $$ \frac{1}{2} \boldsymbol{w}’ \mathbf{D} \boldsymbol{w} + \boldsymbol{d}’ \boldsymbol{w}, $$

where $\mathbf{D}$ is the estimated covariance matrix of error terms (computed as residuals.T @ residuals above) and $\boldsymbol{d} = -\sigma^2 \mathbf{K}$ (model complexities scaled by residual variance). The constraints are expressed as $\mathbf{A}’ \boldsymbol{w} = \boldsymbol{b}$, where $\mathbf{A}$ combines the sum-to-one constraint with non-negativity constraints, and $\boldsymbol{b}$ is the corresponding right-hand side vector.

These inputs are passed to quadprog.solve_qp with the matrices $\mathbf{D}$, $\boldsymbol{d}$, $\mathbf{A}$, and $\boldsymbol{b}$ as defined in the code. The solver returns the optimal weight vector $\widehat{\boldsymbol{w}}_{\text{MMA}}$.4

import quadprog

# solve the QPP and get weights
weights = quadprog.solve_qp(D, d, A, b, meq=1)[0]
weights = np.round(weights, 2)

print("MMA weights:", weights)
MMA weights: [ 0.3  0.   0.7  0.  -0.  -0. ]

The optimal weights suggest that MMA relies on models of degree 1 and 3, i.e. it picks up the signal in the data that linear and cubic terms explain the data best, which aligns with the true underlying DGP. Note that weighting leads to a sparse combination: weights for the models with maximum polynomial degrees 2 and 4 are zero (and essentially zero for degrees 5, and 6)!

We next evaluate individual model performance: Each individual model’s predictions are computed on the test set, along with their respective mean squared errors. The MMA prediction is then constructed as a weighted average using the optimal weights in weights.

from sklearn.metrics import mean_squared_error

# evaluate models on test set
test_predictions = []
test_mse = []

for i, model in enumerate(models):
    poly = PolynomialFeatures(degree=i+1, include_bias=False)
    X_test_poly = poly.fit_transform(X_test)
    pred = model.predict(X_test_poly)
    test_predictions.append(pred)
    mse = mean_squared_error(y_test, pred)
    test_mse.append(mse)

# calculate MMA predictions
test_predictions = np.column_stack(test_predictions)
mma_predictions = test_predictions @ weights
mma_mse = mean_squared_error(y_test, mma_predictions)

# return MSE results in console
print("\nTest Set Mean Squared Errors:")
for i, mse in enumerate(test_mse, 1):
    print(f"Max poly degree {i}: {mse:.4f}")
print(f"MMA: {mma_mse:.4f}")
Test Set Mean Squared Errors:
Max poly degree 1: 0.3556
Max poly degree 2: 0.3691
Max poly degree 3: 0.2729
Max poly degree 4: 0.2773
Max poly degree 5: 0.3313
Max poly degree 6: 0.6844
MMA: 0.2471

Discussion

The results show that MMA achieves lower prediction error than any individual model by optimally combining predictions from the available candidate models:5 Simple linear and quadratic regressions cannot capture the nonlinearity while models with higher polynomial terms overfit the data – both imply bad generalisation to unseen data, which manifests in higher prediction error. The MSE of the MMA estimator is in fact quite close to the minimal achieveable prediction MSE $\sigma^2=.5^2=.25$, which we would attain using the (unknown) true model.6 However, this model is not among our candidate models. While MMA works well in this setting, it’s not the playing field MMA needs to have its optimailty properties derived by Hansen!

To compare the individual models’ and the MMA MSE graphically, we use a bar plot.

# plot the results
plt.figure(figsize=(10, 6))
plt.bar(range(1, p+1), test_mse, alpha=0.5, label='Individual Models')
plt.axhline(y=mma_mse, color='r', linestyle='--', label='MMA')
plt.xlabel('Polynomial Degree')
plt.ylabel('Mean Squared Prediction Error')
plt.legend()
plt.show()

Figure 2: Test Set Performance – Individual Models vs MMA

… and that’s it for now :-).

References

Hansen, Bruce E. 2007. “Least Squares Model Averaging.” Econometrica 75 (4): 1175–89. https://doi.org/https://doi.org/10.1111/j.1468-0262.2007.00785.x.

———. 2014. “Model Averaging, Asymptotic Risk, and Regressor Groups: Model Averaging, Asymptotic Risk, and Regressor Groups.” Quantitative Economics 5 (3): 495–530. https://doi.org/https://doi.org/10.3982/QE332.

———. 2021. Econometrics. University of Wisconsin, Department of Economics. https://books.google.de/books?id=IqTJNQAACAAJ.


  1. The notation is adapted from the less technical treatment of the method in Hansen (2021). ↩︎

  2. For two nested models, the MMA criterion simplifies to a form based on residual sums of squares, easing optimization. For $M>2$, it extends the James-Stein estimator to multiple models, highlighting its link to shrinkage methods (Hansen 2014). ↩︎

  3. Note that computational complexity grows even faster with $p$ if we also account for interactions between regressors. If there are too many candidate models, we could use some pre-selection heuristic that reduces the model space. ↩︎

  4. meq=1 tells the solver that only the first constraint ($\mathbf{1}’\boldsymbol{w} = 1$) is an equality constraint. ↩︎

  5. If you play with different RNG seeds, you will certainly see adverse results. This should not surprise: MMA is an estimator that works (better) on average↩︎

  6. Note that we face $$\textup{MSE}=\mathbb{E}[(y - \widehat{m}(x))^2] = \text{Bias}^2(\widehat{m}(x)) + \text{Variance}(\widehat{m}(x)) + \sigma^2,$$ where bias and variance components vanish for correctly (or overspecified) polynomial models as $n\to\infty$. In finite samples, we want to keep bias and variance small by choosing (and estimating) a model as “close” to the “true” model as possible. ↩︎