Estimating functionals via a proxy

The problem that Monte Carlo integration tackles can be generally stated as: given a function \(f : \mathcal{X} \to \bbR\) over a domain \(\mathcal{X} \subseteq \bbR^{d}\), and a distribution \(\pi\) over \(\mathcal{X}\), compute the expectation \(\bbE_{\sfx \sim \pi}[f(\sfx)] =: \mathcal{Q}(f; \pi)\).

Monte Carlo integration performs this by drawing numerous samples from \(\pi\) independently \(\{x^{(i)}\}\), and computes the estimate \begin{equation*} \widehat{\mathcal{Q}}_{N}(f; \pi) = \frac{1}{N}\sum_{i=1}^{N} f(x^{(i)})~. \end{equation*} By the law of large numbers and under suitable regularity conditions on \(f\) and \(\pi\), \(\widehat{\mathcal{Q}}_{N}(f; \pi) \to \mathcal{Q}(f; \pi)\) as \(N \to \infty\). From a more practical standpoint, we would need non-asymptotic rates i.e., how quickly the error between \(\widehat{\mathcal{Q}}_{N}(f; \pi)\) and \(\mathcal{Q}(f; \pi)\) decays as a function of \(N\). Through a Chebyshev's inequality argument, one might be able to show that the absolute error between \(\mathcal{Q}(f; \pi)\) and \(\widehat{\mathcal{Q}}_{N}(f; \pi)\) (an unbiased estimator of \(\mathcal{Q}(f; \pi)\)) decays as \(\frac{1}{\sqrt{N}}\) (note that this is a probabilistic statement).

In this note, we investigate how different \(\widehat{\mathcal{Q}}_{N}(f; \mu)\) can be to \(\mathcal{Q}(f; \pi)\) -- there is no typographical error here, and we are indeed curious to see how samples from another distribution \(\mu\) (also supported over \(\mathcal{X}\)) can be used to estimate \(\mathcal{Q}(f; \pi)\). More specifically, separating out the statistical error (the "\(\frac{1}{\sqrt{N}}\)" term), the goal is to understand how properties of \(f\) interplay with notions of closeness between \(\mu\) and \(\pi\) to obtain non-asymptotic rates. This is indeed progress in the right direction since by the triangle inequality \begin{equation*} \left|\mathcal{Q}(f; \pi) - \widehat{\mathcal{Q}}_{N}(f; \mu)\right| \leq \underbrace{\left|\mathcal{Q}(f; \mu) - \widehat{\mathcal{Q}}_{N}(f; \mu)\right|}_{\text{Statistical Error}} + \underbrace{\left|\mathcal{Q}(f; \pi) - \mathcal{Q}(f; \mu)\right|}_{\text{Misspecification Error}}~. \end{equation*}

We will first two notions of closeness that we highlight in this note. These have been widely studied, and have a rich history; readers encouraged to visit the pages hyperlinked with the definitions.

Def. Total Variation: The total variation (abbrev. TV) distance between \(\mu\) and \(\pi\) is defined as \begin{equation*} \mathrm{TV}(\mu, \pi) := \sup_{A \subseteq \mathcal{X}} \mu(A) - \pi(A)~. \end{equation*}

Def. Total Variation (dual): Equivalently, \(\mathrm{TV}(\mu, \pi)\) can be defined as \begin{equation*} \mathrm{TV}(\mu, \pi) := \sup_{g : \|g\|_{\infty} \leq 1} \bbE_{\sfx \sim \mu}[g(\sfx)] - \bbE_{\sfx \sim \pi}[g(\sfx)] \end{equation*} where \(\{g : \|g\|_{\infty} \leq 1\}\) is the class of all functions that are \(1\)-bounded i.e., for all \(x \in \mathcal{X}\), \(|f(x)| \leq 1\).

Def. 1-Wasserstein: The 1-Wasserstein (abbrev. \(W_{1}\)) distance between \(\mu\) and \(\pi\) is defined as \begin{equation*} \mathrm{W}_{1}(\mu, \pi) := \inf_{\gamma \in \Gamma(\mu, \pi)} \bbE_{(\sfx, \sfy) \sim \gamma}[\|x - y\|] \end{equation*} where \(\Gamma(\mu, \pi)\) is the collection of all joint distributions whose marginals are \(\mu\) and \(\pi\) respectively.

Def. 1-Wasserstein (dual): Equivalently, \(\mathrm{W}_{1}(\mu, \pi)\) can be defined as \begin{equation*} \mathrm{W}_{1}(\mu, \pi) := \sup_{g : \|g\|_{\mathrm{Lip}} \leq 1} \bbE_{\sfx \sim \mu}[g(\sfx)] - \bbE_{\sfx \sim \pi}[g(\sfx)] \end{equation*} where \(\{g : \|g\|_{\mathrm{Lip}} \leq 1\}\) is the class of all functions that are \(1\)-Lipschitz continuous i.e., for any \(x, y \in \mathcal{X}\), \(|g(x) - g(y)| \leq \|x - y\|\).

From the dual definitions of the \(\mathrm{TV}\) and \(\mathrm{W}_{1}\) distances, one might surmise rather straightforwardly that the error between \(\mathcal{Q}(f; \pi)\) and \(\widehat{\mathcal{Q}}_{N}(f; \mu)\) is related to the function classes in these definitions. More precisely, focusing on the misspecification error,

Def. \(\alpha\)-Rényi divergence: The \(\alpha\)-Rényi divergence (abbrev. \(\mathrm{D}_{\alpha}\)) of \(\mu\) w.r.t. \(\pi\) is defined as \begin{equation*} \mathrm{D}_{\alpha}(\mu \| \pi) := \frac{1}{\alpha - 1}\log \bbE_{\sfx \sim \pi}\left[r(\sfx)^{\alpha}\right] \end{equation*} where \(r(x)\) is the Radon-Nikodym derivative of \(\mu\) w.r.t. \(\pi\).

A limiting case of the \(\alpha\)-Rényi divergence for \(\alpha \to 1\) is the more popular Kullback-Leibler divergence.

Def. Kullback-Leibler divergence: The Kullback-Leibler (abbrev. KL) divergence of \(\mu\) w.r.t. \(\pi\) is defined as \begin{equation*} \mathrm{KL}(\mu \| \pi) := \bbE_{\sfx \sim \pi}[r(\sfx) \log r(\sfx)] \end{equation*} where \(r(x)\) is the Radon-Nikodym derivative of \(\mu\) w.r.t. \(\pi\).

The quantity \(\mathrm{KL}(\mu \| \pi)\) is equivalent to the entropy of the ratio \(r\) measured w.r.t. \(\pi\), and provides a sense of randomness in \(r(\sfx)\) when \(\sfx \sim \pi\). Another proxy for this is the variance of \(r(\sfx)\), which is called the \(\chi^{2}\) divergence.

Def. \(\chi^{2}\) divergence: The \(\chi^{2}\) divergence of \(\mu\) w.r.t. \(\pi\) is defined as \begin{equation*} \chi^{2}(\mu \| \pi) := \bbE_{\sfx \sim \pi}[(r(\sfx) - 1)^{2}] \end{equation*} where \(r(x)\) is the Radon-Nikodym derivative of \(\mu\) w.r.t. \(\pi\).

Interestingly, \(\mathrm{KL}\) satisfies an elegant dual form -- this is referred to as the Donsker-Varadhan dual characterisation. This can be proven using the more general version that appears as Eq 5.1.2 in the textbook by Bakry, Gentil, and Ledoux. \begin{equation*} \bbE_{\sfx \sim \mu}[g(\sfx)] \leq \mathrm{KL}(\mu \| \pi) + \log \bbE_{\sfx \sim \pi}\left[e^{g}\right]~. \end{equation*} While a direct application of this seems enticing, it would lead to a bias when \(\mu \leftarrow \pi\). We can get around this when \((f, \pi)\) are suitably nice.

Lemma 1: Let \((f, \pi)\) be a pair for which \(\frac{1}{\lambda^{2}}\log \bbE_{\sfx \sim \pi}[e^{\lambda (f(\sfx) - \bbE_{\sfx \sim \pi}[f(\sfx)])}] \lt \infty\) for all \(\lambda \in \bbR\). Then \begin{equation*} |\mathcal{Q}(f; \mu) - \mathcal{Q}(f; \pi)| \leq 2\sqrt{\mathrm{KL}(\mu \| \pi)} \cdot \sqrt{\sup_{\lambda \in \bbR} \frac{1}{\lambda^{2}}\log \bbE_{\sfx \sim \pi}[e^{\lambda (f(\sfx) - \bbE_{\sfx \sim \pi}[f(\sfx)])}]}~. \end{equation*}

Proof: Let \(\lambda > 0\) be arbitrary. We have \begin{align*} \lambda \bbE_{\sfx \sim \mu}[f(\sfx)] &= \lambda \bbE_{\sfx \sim \pi}[r(\sfx) \cdot f(\sfx)] \\ &\leq \mathrm{KL}(\mu \| \pi) + \log \bbE_{\sfx \sim \pi}\left[e^{\lambda f(\sfx)}\right]~. \end{align*} Now subtracting \(\lambda \bbE_{\sfx \sim \pi}\) on both sides, we get \begin{equation*} \lambda (\bbE_{\sfx \sim \mu}[f(\sfx)] - \bbE_{\sfx \sim \pi}[f(\sfx)]) \leq \mathrm{KL}(\mu \| \pi) + \log \bbE_{\sfx \sim \pi}\left[e^{\lambda (f(\sfx) - \bbE_{\sfx \sim \pi}[f(\sfx)])}\right]~. \end{equation*} By the property of \((f, \pi)\) assumed, this can be further bounded from above as \begin{align*} \mathcal{Q}(f; \mu) - \mathcal{Q}(f; \pi) &\leq \frac{1}{\lambda} \mathrm{KL}(\mu \| \pi) + \lambda \cdot \left(\frac{1}{\lambda^{2}}\log \bbE_{\sfx \sim \pi}[e^{\lambda (f(\sfx) - \bbE_{\sfx \sim \pi}[f(\sfx)])}]\right) \\ &\leq \frac{1}{\lambda} \mathrm{KL}(\mu \| \pi) + \lambda \cdot \left\{\sup_{\lambda \in \bbR} \frac{1}{\lambda^{2}}\log \bbE_{\sfx \sim \pi}[e^{\lambda (f(\sfx) - \bbE_{\sfx \sim \pi}[f(\sfx)])}]\right\}~. \end{align*} Since \(\lambda > 0\) is arbitrary, one can minimise the right hand over positive reals to obtain \begin{equation*} \mathcal{Q}(f; \mu) - \mathcal{Q}(f; \pi) \leq 2\sqrt{\mathrm{KL}(\mu \| \pi)} \cdot \sqrt{\sup_{\lambda \in \bbR} \frac{1}{\lambda^{2}}\log \bbE_{\sfx \sim \pi}[e^{\lambda (f(\sfx) - \bbE_{\sfx \sim \pi}[f(\sfx)])}]}~. \end{equation*} Finally, to obtain a lower bound, substitute \(f \leftarrow -f\) above. Q.E.D.

Since the KL divergence is a limiting case of the \(\alpha\)-Rényi divergence, a natural question to ask is it is possible to obtain a characterisation of a "nice" pair for which the misspecification error is tied to \(\mathrm{D}_{\alpha}(\mu \| \pi)\). We first work through the \(\chi^{2}\) divergence case.

Lemma 2: Let \((f, \pi)\) be a pair such that \(\bbE_{\sfx \sim \pi}[f(\sfx)^{2}] \lt \infty\). Then \begin{equation*} |\mathcal{Q}(f; \mu) - \mathcal{Q}(f; \pi)| \leq \sqrt{\chi^{2}(\mu \| \pi) \bbE_{\sfx \sim \pi}[f(\sfx)^{2}]}~. \end{equation*}

Proof: Starting with the definition of \(\mathcal{Q}(f; \mu) - \mathcal{Q}(f; \pi)\), \begin{equation*} |\mathcal{Q}(f; \mu) - \mathcal{Q}(f; \pi)| \leq |\bbE_{\sfx \sim \pi}[(r(\sfx) - 1) f(\sfx)]| \leq \sqrt{\bbE_{\sfx \sim \pi}[(r(\sfx) - 1)^{2}]} \cdot \sqrt{\bbE_{\sfx \sim \pi}[f(\sfx)^{2}]}~. \end{equation*} By the definition of \(\chi^{2}(\mu \| \pi)\), we obtain the statement of the lemma. Q.E.D.

Lemma 3: Let \((f, \pi)\) be a pair such that \(\bbE_{\sfx \sim \pi}[|f(\sfx)|^{\alpha^{\star}}] \lt \infty\) for \(\alpha^{\star} := \frac{\alpha}{\alpha - 1}\) and \(\alpha \geq 2\). Then \begin{equation*} |\mathcal{Q}(f; \mu) - \mathcal{Q}(f; \pi)| \leq \left(\exp((\alpha - 1) \mathrm{D}_{\alpha}(\mu \| \pi)) - 1\right)^{\frac{1}{\alpha}} \bbE_{\sfx \sim \pi}[|f(\sfx)|^{\alpha^{\star}}]^{\frac{1}{\alpha^{\star}}}~. \end{equation*}

Proof: Starting with the difference \(\mathcal{Q}(f; \mu) - \mathcal{Q}(f; \pi)\), \begin{align*} |\mathcal{Q}(f; \mu) - \mathcal{Q}(f; \pi)| &= |\bbE_{\sfx \sim \pi}[r(\sfx) f(\sfx)] - \bbE_{\sfx \sim \pi}[f(\sfx)] \\ &=|\bbE_{\sfx \sim \pi}[f(\sfx) (r(\sfx) - 1)]| \\ &\leq \bbE_{\sfx \sim \pi}[|f(\sfx)| \cdot |r(\sfx) - 1|] \\ &\leq \bbE_{\sfx \sim \pi}[|f(\sfx)|^{\alpha^{\star}}]^{\frac{1}{\alpha^{\star}}} \cdot \bbE_{\sfx \sim \pi}[|r(\sfx) - 1|^{\alpha}]^{\frac{1}{\alpha}}~. \end{align*} Now, we use an elegant inequality (which was pointed out to me by Gemini): for any \(t > 0\), \(|t - 1|^{\alpha} \leq t^{\alpha} - 1 - \alpha (t - 1)\) when \(\alpha \geq 2\) (proven here). With this, \begin{equation*} \bbE_{\sfx \sim \pi}[|r(\sfx) - 1|^{\alpha}] \leq \bbE_{\sfx \sim \pi}[r(\sfx)^{\alpha}] - 1 = \exp((\alpha - 1) \mathrm{D}_{\alpha}(\mu \| \pi)) - 1~. \end{equation*} This results in the final bound \begin{equation*} |\mathcal{Q}(f; \mu) - \mathcal{Q}(f; \pi)| \leq \left(\exp((\alpha - 1) \mathrm{D}_{\alpha}(\mu \| \pi)) - 1\right)^{\frac{1}{\alpha}} \cdot \bbE_{\sfx \sim \pi}[|f(\sfx)|^{\alpha^{\star}}]^{\frac{1}{\alpha^{\star}}}~. \end{equation*} Q.E.D.

This gives us a broader understanding of how Rényi / KL similarities interplay with misspecification error, and properties of \(f\). In particular, we have a sense of this misspecification error when we know

This is relevant since Lemma 3 makes an interesting point about comparing certificates of closeness involving \(\mathrm{D}_{\alpha}\) (and its limit \(\alpha \to 1\), given by the \(\mathrm{KL}\)). Suppose we are given certificates of closeness between \(\mu\) and \(\pi\) in \(\mathrm{D}_{\mathrm{\alpha_{1}}}\) and \(\mathrm{D}_{\mathrm{\alpha_{2}}}\) for \(\alpha_{1} \geq \alpha_{2}\). Since \(\mathrm{D}_{\mathrm{\alpha_{1}}} \geq \mathrm{D}_{\mathrm{\alpha_{2}}}\) by monotonicity of \(\alpha\)-Rényi divergences, \(\mathrm{D}_{\mathrm{\alpha_{1}}}\)-certificate implies \(\mathrm{D}_{\mathrm{\alpha_{2}}}\)-certificate. Intuitively, one would then expect that the scope of functions for which we can misspecification error to be larger when given a \(\mathrm{D}_{\mathrm{\alpha_{1}}}\)-certificate in comparison to \(\mathrm{D}_{\mathrm{\alpha_{2}}}\)-certificate.

For any \(\alpha > 1\), we know that \(\alpha^{\star} := \frac{\alpha}{\alpha - 1}\) is decreasing in \(\alpha\). Through Jensen's inequality, we know that for any \(p, q\) such that \(p \gt q\), \begin{align*} \bbE_{\sfx \sim \pi}[|f(\sfx)|^{p}] &= \bbE_{\sfx \sim \pi}[(|f(\sfx)|^{q})^{\frac{p}{q}}] \\ &\geq \bbE_{\sfx \sim \pi}[|f(\sfx)|^{q}]^{\frac{p}{q}}~. \end{align*} Consequently, knowing a bound \(\bbE_{\sfx \sim \pi}[|f(\sfx)|^{p}]\) implies a bound for \(\bbE_{\sfx \sim \pi}[|f(\sfx)|^{q}]\). This leads to the following remark.

Remark: For a \(\mathrm{D}_{\mathrm{\alpha_{1}}}\)-certificate and \(\mathrm{D}_{\mathrm{\alpha_{2}}}\)-certificate, we see that the misspecification error can be controlled explicitly when \(\bbE_{\sfx \sim \pi}[|f(\sfx)|^{\alpha_{1}^{\star}}]\) and \(\bbE_{\sfx \sim \pi}[|f(\sfx)|^{\alpha_{2}^{\star}}]\) are bounded respectively. Since \(\alpha_{1}^{\star} \leq \alpha_{2}^{\star}\) and by the monotonicity of \(L^{p}\) norms over a probability measure (shown above), we know that the class of functions such that \(\bbE_{\sfx \sim \pi}[|f(\sfx)|^{\alpha_{2}^{\star}}] \lt \infty\) is contained in \(\bbE_{\sfx \sim \pi}[|f(\sfx)|^{\alpha_{1}^{\star}}] \lt \infty\). In the limiting case \(\alpha \to 1\), the condition in Lemma 1 is a sub-Gaussian characterisation of the random variable \(f(\sfx)\) where \(\sfx \sim \pi\), which implies the boundness of all moments of \(f(\sfx)\) -- the strongest condition imaginable over \(f\) in the \(\alpha\)-Rényi hierarchy. In summary, we require a weaker condition on the function class when given a stronger closeness guarantee to bound the misspecification error, confirming our intuition.

It remains to build broader classes of \((f, \pi)\) for which we have bounds as stated above. Some interesting examples are listed below.

Note that here we've only focused on the misspecification error. The best misspecification error can be obtained by just setting \(\mu\) as \(\pi\). Therefore, the goal of finding a \(\mu\) in the task of estimating \(\mathcal{Q}(f; \pi)\) can be dictated by two objectives for a given \(f\): (a) \(\mu\) should be close to \(\pi\), and (b) \(f(\sfx)\) for \(\sfx \sim \mu\) concentrates well around its mean.

Proposition 1: Let \(t > 0\). Then \(|t - 1|^{q} \leq t^{q} - 1 - q(t - 1)\) for \(q \geq 2\).

Proof: Note that when \(q = 2\), this is precisely expanding \((t - 1 + 1)^{2}\). There are two cases: when \(t \geq 1\) and when \(t \in [0, 1)\). A key fact we will use is the superadditivity of convex functions over non-negative reals that take the value \(0\) at \(0\).

When \(t \geq 1\), consider the function \(g(t) := t^{q} - (t - 1)^{q} - 1 - q(t - 1)\). Its derivative is \(qt^{q - 1} - q(t - 1)^{q - 1} - q\). Since \(s \mapsto s^{q - 1}\) over \([0, \infty)\) is convex and maps \(0\) to \(0\), we have superadditivity: \(t^{q - 1} = (t - 1 + 1)^{q - 1} \geq (t - 1)^{q- 1} + 1\), which implies that \(g(t)\) is increasing. Since \(g(1) = 0\), we have the required inequality for \(t \geq 1\).
When \(t \in [0, 1)\), consider the function \(g(t) := (1 - t)^{q} - t^{q} + 1 + q(t - 1)\). Its derivative is \(-q(1 - t)^{q - 1} - qt^{q - 1} + q\). Again, by superadditivity \(1 = (1 - t + t)^{q - 1} \geq (1 - t)^{q - 1} + t^{q - 1}\). Therefore, \(g(t)\) is a decreasing function, and \(g(1) = 0\), which implies the required statement for \([0, 1)\).