Statistical leverage in linear models

Since the beginning of the year, I've been increasingly interested in understanding the notion of leverage in statistics. For a dataset which is a collection of datapoints, the leverage for a datapoint seeks to quantify the "leverage" (as the name suggests) that that datapoint has on the outcome of a model used to perform inference on the dataset. In other words (and with lesser redundancy), the leverage of a datapoint quantifies the effect of perturbing that datapoint on the outcome of the model.

Leverage is a well-studied concept for ordinary linear regression. In the standard setup of linear regression, we are given a dataset \(\{(x_{i}, y_{i})\}\) where \(x_{i} \in \mathbb{R}^{d}\) and \(y_{i} \in \mathbb{R}\) for all \(i\), and the goal is to identify the "line of best fit". The quality of the fit is determined by the squared residual across all datapoints represented by \[\mathcal{L}_{\mathrm{OLS}}(\theta) := \sum_{i} (y_{i} - \theta^{\top}x_{i})^{2}~.\] This can be also be viewed as uniformly weighting all datapoints, and hence the "ordinary" in ordinary least squares. Using \(\mathbf{X} \in \mathbb{R}^{N \times d}\) and \(\mathbf{y} \in \mathbb{R}^{N}\) to collectively represent the covariates and the responses respectively, we can write the ordinary least squares as a minimiser of \(\mathcal{L}_{\mathrm{OLS}}\) over all \(\theta \in \mathbb{R}^{d}\): \[\widehat{\theta}_{\mathrm{OLS}} \in \mathop{\mathrm{argmin}}_{\theta \in \mathbb{R}^{d}}~~ \mathcal{L}_{\mathrm{OLS}}(\theta) := \|\mathbf{y} - \mathbf{X}\theta\|^{2}_{2}~.\]

Note the careful mention of \(\widehat{\theta}_{\mathrm{OLS}}\) as "a" minimiser. This is because by optimality, \(\widehat{\theta}_{\mathrm{OLS}}\) satisfies \((\mathbf{X}^{\top}\mathbf{X})\widehat{\theta}_{\mathrm{OLS}} = \mathbf{X}\mathbf{y}\). When \(\mathbf{X}^{\top}\mathbf{X}\) is not full rank, this linear equation has multiple solutions, hence \(\widehat{\theta}_{\mathrm{OLS}}\) is "a" minimiser; this occurs for instance when the system is underspecified.

Nonetheless, assuming uniqueness of the minimiser, we have that this minimiser is \(\widehat{\theta}_{\mathrm{OLS}} = (\mathbf{X}^{\top}\mathbf{X})^{-1}\mathbf{X}\mathbf{y}\). The prediction for the dataset is \(\widehat{\mathbf{y}} = \mathbf{X}\widehat{\theta}_{\mathrm{OLS}}\). With this, the leverage \(h_{i}\) of the \(i^{th}\) datapoint is defined as the infinitesimal change in the prediction for that datapoint caused by an infinitesimal change in the actual response \(y_{i}\). Mathematically, \[h_{i} = \frac{\partial \widehat{y}_{i}}{\partial y_{i}}~.\] Due to the closed form expression for \(\widehat{y}_{i}\) from above, we have \(h_{i} = x_{i}^{\top}(\mathbf{X}^{\top}\mathbf{X})^{-1}x_{i}\). By squinting hard enough, one can see that this happens to be the \(i^{th}\) diagonal element of the matrix \(\mathbf{X}(\mathbf{X}^{\top}\mathbf{X})^{-1}\mathbf{X}^{\top}\). As a result, we have an understanding of certain characteristics of the sequence \(\{h_{i}\}\), which are

for every \(i\), \(h_{i} \in [0, 1]\), and
the sum of leverage scores over all datapoints is \(d\).

In this note, I try to understand how these characteristics change in the presence of regularisation. In particular, define the regularised OLS solution as \[\widehat{\theta}_{\mathrm{rOLS}} = \mathop{\mathrm{argmin}}_{\theta \in \mathbb{R}^{d}} \|\mathbf{y} - \mathbf{X}\theta\|_{2}^{2} + \frac{\sigma}{2} \cdot \|\theta\|^{2}_{2}~.\] Here the solution is unique for any \(\sigma > 0\), and this is due to the objective being strongly convex. The closed form expression for this minimiser is given by \(\widehat{\theta}_{\mathrm{rOLS}} = (\mathbf{X}^{\top}\mathbf{X} + \sigma \cdot \mathbb{I}_{d})^{-1}\mathbf{X}\mathbf{y}\). The leverage for the \(i^{th}\) datapoint for this model is given by \(h_{i} = x_{i}^{\top}(\mathbf{X}^{\top}\mathbf{X} + \sigma \cdot \mathbb{I}_{d})^{-1}x_{i}\), which is \(i^{th}\) diagonal element of \(\mathbf{X}(\mathbf{X}^{\top}\mathbf{X} + \sigma \cdot \mathbb{I}_{d})^{-1}\mathbf{X}^{\top}\).

We deduce the following properties about \(\{h_{i}\}\) for the regularised OLS model. Let the eigendecomposition of \(\mathbf{X}^{\top}\mathbf{X}\) be \(\overline{U}~\overline{\Lambda}~\overline{U}^{\top}\).

Lemma 1: The first \(d\) eigenvalues of \(\mathbf{X}(\mathbf{X}^{\top}\mathbf{X} + \sigma \cdot \mathbb{I}_{d})^{-1}\mathbf{X}^{\top}\) are \(\left\{\frac{\overline{\Lambda}_{ii}}{\overline{\Lambda}_{ii} + \sigma}\right\}_{i=1}^{d}\). The remaining \(N - d\) eigenvalues are \(0\).

Proof: With the eigendecomposition of \(\mathbf{X}^{\top}\mathbf{X}\), the matrix we are interested in can be expressed as \begin{align*} M &:= \mathbf{X}(\mathbf{X}^{\top}\mathbf{X} + \sigma \cdot \mathbb{I}_{d})^{-1}\mathbf{X} \\ &= \mathbf{X}(\overline{U}~\overline{\Lambda}~\overline{U}^{\top} + \sigma \cdot \mathbb{I}_{d})^{-1}\mathbf{X} \\ &= \mathbf{X}\overline{U}(\overline{\Lambda} + \sigma \cdot \mathbb{I}_{d})^{-1}\overline{U}^{\top}\mathbf{X}^{\top}~. \end{align*} Note that the columns of \(\mathbf{X}\overline{U} = \{\mathbf{X}\overline{u}_{i}\}_{i=1}^{d}\) are orthogonal. This can be shown by direct calculation: \((\mathbf{X}\overline{u}_{i})^{\top}(\mathbf{X}\overline{u}_{j}) = u_{i}^{\top}\mathbf{X}^{\top}\mathbf{X}u_{j} = u_{i}^{\top}\overline{U}~\overline{\Lambda}~\overline{U}^{\top}u_{j} = e_{i}^{\top}\overline{\Lambda}e_{j}\). When \(i \neq j\), this is \(0\) as \(\overline{\Lambda}\) is diagonal. When \(i = j\), this is \(\overline{\Lambda}_{ii}\). This rescaling can be incorporated as a post-multiplication by \(\overline{\Lambda}^{-\frac{1}{2}}\) like so \(M = \mathbf{X}\overline{U}\sqrt{\overline{\Lambda}^{-1}}~\sqrt{\overline{\Lambda}}~(\Lambda + \sigma \cdot \mathbb{I}_{d})^{-1}\sqrt{\overline{\Lambda}}~\sqrt{\overline{\Lambda}^{-1}}\overline{U}^{\top}\mathbf{X}^{\top}\). The matrix \(\mathbf{X}\overline{U}\sqrt{\overline{\Lambda}^{-1}}\) is an orthonormal basis for \(\mathbb{R}^{d}\), and we denote its \(d\) columns by \(\widetilde{u}_{i}\). This decomposition can be expressed as \[M = \sum_{i=1}^{d}\frac{\overline{\Lambda}_{ii}}{\overline{\Lambda}_{ii} + \sigma} \widetilde{u}_{i}\widetilde{u}_{i}^{\top} = \sum_{i=1}^{d}\frac{\overline{\Lambda}_{ii}}{\overline{\Lambda}_{ii} + \sigma} \widetilde{u}_{i}\widetilde{u}_{i}^{\top} + \sum_{j=1}^{N - d} 0 \cdot \widetilde{v}_{i}\widetilde{v}_{i}^{\top}~,\] where \(\widetilde{v}_{i}\) is the basis of the null space of \(\mathbf{X}\). Hence, this forms an eigendecomposition, and the eigenvalues can be read off directly from the sum above.

Corollary 1: The leverage scores \(h_{i}\) for regularised OLS are bounded as \(0 \leq h_{i} \leq \max\limits_{1 \leq j \leq d} \frac{\overline{\Lambda}_{jj}}{\overline{\Lambda}_{jj} + \sigma}\).

Proof: Note that the leverage score \(h_{i}\) is the \(i^{th}\) diagonal element of \(M\) above. We have \(h_{i} = e_{i}^{\top}Me_{i}\), which we note is non-negative first, and secondly by the definition of the maximum eigenvalue: \[e_{i}^{\top}Me_{i} \leq \sup\limits_{\|u\| \leq 1} u^{\top}Mu = \max\limits_{1 \leq j \leq d} \frac{\overline{\Lambda}_{jj}}{\overline{\Lambda}_{jj} + \sigma}~.\]

Corollary 2: The sum of leverage scores \(h_{i}\) for regularised OLS is exactly \(\sum\limits_{i=1}^{d}\frac{\overline{\Lambda}_{ii}}{\overline{\Lambda}_{ii} + \sigma}\).

Proof: The sum of leverage scores is the sum of diagonal elements of \(M\), which is its trace.

Another formulation of linear regression is weighted least squares wherein the contribution of each datapoint to the residual is weighted. For a sequence of \(N\) weights \(\{w_{i}\}_{i=1}^{N}\), regularised weighted least squares is defined by the minimsation problem \[\widehat{\theta}_{\mathrm{rWLS}} \in \mathop{\mathrm{argmin}}_{\theta \in \mathbb{R}^{d}} \sum_{i=1}^{N}w_{i} \cdot (y_{i} - x_{i}^{\top}\theta)^{2} + \frac{\sigma}{2} \cdot \|\theta\|^{2}_{2} = \|\mathbf{y} - \mathbf{X}\theta\|_{\mathbf{W}}^{2} + \frac{\sigma}{2} \cdot \|\theta\|^{2}_{2}~.\] Above, \(\mathbf{W}\) is a diagonal matrix formed with the diagonal entries given by \(\{w_{i}\}_{i=1}^{d}\), and \(\|v\|_{A}^{2} := v^{\top}Av\). When \(\sigma > 0\), the solution \(\widehat{\theta}_{\mathrm{rWLS}}\) is unique for the same reason that \(\widehat{\theta}_{\mathrm{rOLS}}\) is unique. This solution has a closed form expression \(\widehat{\theta}_{\mathrm{rWLS}} = (\mathbf{X}^{\top}\mathbf{W}\mathbf{X} + \sigma \cdot \mathbb{I}_{d})^{-1}\mathbf{X}^{\top}\mathbf{W}\mathbf{y}\).

The leverage for the \(i^{th}\) datapoint in this model is given by \(h_{i} = x_{i}^{\top}(\mathbf{X}^{\top}\mathbf{W}\mathbf{X} + \sigma \cdot \mathbb{I}_{d})^{-1}w_{i}x_{i}\). With \(\widetilde{x}_{i} = \sqrt{w_{i}} \cdot x_{i}\), this can be written as \(h_{i} = \widetilde{x}_{i}^{\top}(\widetilde{\mathbf{X}}^{\top}\widetilde{\mathbf{X}} + \sigma \cdot \mathbb{I}_{d})^{-1}\widetilde{x}_{i}\). Let \(\{\widetilde{\Lambda}_{jj}\}_{j=1}^{d}\) be the eigenvalues of \(\mathbf{X}^{\top}\mathbf{W}\mathbf{X}\). We have the following corollaries for this leverage scores of this model.

Corollary 3: The leverage scores \(h_{i}\) for regularised weighted least squares are bounded as \(0 \leq h_{i} \leq \max\limits_{1 \leq j \leq d} \frac{\widetilde{\Lambda}_{jj}}{\widetilde{\Lambda}_{jj} + \sigma}\).

Corollary 4: The sum of leverage scores \(h_{i}\) for regularised weighted least squares is exactly \(\sum\limits_{i = 1}^{d} \frac{\widetilde{\Lambda}_{ii}}{\widetilde{\Lambda}_{ii} + \sigma}\).

There are some questions that one could ask about this definition of leverage:

Does the definition of \(h_{i}\) seem too "selfish"?

For 1., philosophically, it would be more appropriate to understand the maximal effect on the prediction with an infinitestimal change in \(y_{i}\) over all datapoints. That is to say: \[\widetilde{h}_{i} = \max_{j} \left|\frac{\partial \widehat{y}_{j}}{\partial y_{i}}\right|\] would seem like a more appropriate notion of the leverage of datapoint \(i\) (the absolute is to ensure that it is non-negative). For the linear regression models above, interestingly this quantity cannot be arbitrarily different, and this is due to the form of \(\frac{\partial \widehat{y}_{j}}{\partial y_{i}}\) which can regularised weighted least squares (in generality) is given by \[\frac{\partial\widehat{y}_{j}}{\partial y_{i}} = \widetilde{x}_{j}^{\top} (\widetilde{\mathbf{X}}^{\top}\widetilde{\mathbf{X}} + \sigma \cdot \mathbb{I}_{d})^{-1}\widetilde{x}_{i}~.\] By the Cauchy-Schwarz inequality \(|a^{\top}b| \leq \|a\|_{2} \cdot \|b\|_{2}\) applied to \(a = (\widetilde{\mathbf{X}}^{\top}\widetilde{\mathbf{X}} + \sigma \cdot \mathbb{I}_{d})^{-\frac{1}{2}}\widetilde{x}_{i}\) and \(b = (\widetilde{\mathbf{X}}^{\top}\widetilde{\mathbf{X}} + \sigma \cdot \mathbb{I}_{d})^{-\frac{1}{2}}\widetilde{x}_{i}\) \[\left|\widetilde{x}_{j}^{\top} (\widetilde{\mathbf{X}}^{\top}\widetilde{\mathbf{X}} + \sigma \cdot \mathbb{I}_{d})^{-1}\widetilde{x}_{i}\right| \leq \sqrt{\widetilde{x}_{i}^{\top} (\widetilde{\mathbf{X}}^{\top}\widetilde{\mathbf{X}} + \sigma \cdot \mathbb{I}_{d})^{-1}\widetilde{x}_{i}} \cdot \sqrt{\widetilde{x}_{j}^{\top} (\widetilde{\mathbf{X}}^{\top}\widetilde{\mathbf{X}} + \sigma \cdot \mathbb{I}_{d})^{-1}\widetilde{x}_{j}} = \sqrt{h_{i} \cdot h_{j}}~.\]