### Deep learning: a statistical viewpoint (with P. Bartlett and A. Montanari)

preprint, 2021

The remarkable practical success of deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find near-optimal solutions to non-convex optimization problems, and despite giving a near-perfect fit to training data without any explicit effort to control model complexity, these methods exhibit excellent predictive accuracy. We conjecture that specific principles underlie these phenomena: that overparametrization allows gradient methods to find interpolating solutions, that these methods implicitly impose regularization, and that overparametrization leads to benign overfitting, that is, accurate predictions despite overfitting training data. In this article, we survey recent progress in statistical learning theory that provides examples illustrating these principles in simpler settings. We first review classical uniform convergence results and why they fall short of explaining aspects of the behavior of deep learning methods. We give examples of implicit regularization in simple settings, where gradient methods lead to minimal norm functions that perfectly fit the training data. Then we review prediction methods that exhibit benign overfitting, focusing on regression problems with quadratic loss. For these methods, we can decompose the prediction rule into a simple component that is useful for prediction and a spiky component that is useful for overfitting but, in a favorable setting, does not harm prediction accuracy. We focus specifically on the linear regime for neural networks, where the network can be approximated by a linear model. In this regime, we demonstrate the success of gradient flow, and we consider benign overfitting with two-layer networks, giving an exact asymptotic analysis that precisely demonstrates the impact of overparametrization. We conclude by highlighting the key challenges that arise in extending these insights to realistic deep learning settings.

The remarkable practical success of deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find near-optimal solutions to non-convex optimization problems, and despite giving a near-perfect fit to training data without any explicit effort to control model complexity, these methods exhibit excellent predictive accuracy. We conjecture that specific principles underlie these phenomena: that overparametrization allows gradient methods to find interpolating solutions, that these methods implicitly impose regularization, and that overparametrization leads to benign overfitting, that is, accurate predictions despite overfitting training data. In this article, we survey recent progress in statistical learning theory that provides examples illustrating these principles in simpler settings. We first review classical uniform convergence results and why they fall short of explaining aspects of the behavior of deep learning methods. We give examples of implicit regularization in simple settings, where gradient methods lead to minimal norm functions that perfectly fit the training data. Then we review prediction methods that exhibit benign overfitting, focusing on regression problems with quadratic loss. For these methods, we can decompose the prediction rule into a simple component that is useful for prediction and a spiky component that is useful for overfitting but, in a favorable setting, does not harm prediction accuracy. We focus specifically on the linear regime for neural networks, where the network can be approximated by a linear model. In this regime, we demonstrate the success of gradient flow, and we consider benign overfitting with two-layer networks, giving an exact asymptotic analysis that precisely demonstrates the impact of overparametrization. We conclude by highlighting the key challenges that arise in extending these insights to realistic deep learning settings.

### Majorizing Measures, Sequential Complexities, and Online Learning (with A. Block and Y. Dagan)

preprint, 2021

We introduce the technique of generic chaining and majorizing measures for controlling sequential Rademacher complexity. We relate majorizing measures to the notion of fractional covering numbers, which we show to be dominated in terms of sequential scale-sensitive dimensions in a horizon-independent way, and, under additional complexity assumptions establish a tight control on worst-case sequential Rademacher complexity in terms of the integral of sequential scale-sensitive dimension. Finally, we establish a tight contraction inequality for worst-case sequential Rademacher complexity. The above constitutes the resolution of a number of outstanding open problems in extending the classical theory of empirical processes to the sequential case, and, in turn, establishes sharp results for online learning.

We introduce the technique of generic chaining and majorizing measures for controlling sequential Rademacher complexity. We relate majorizing measures to the notion of fractional covering numbers, which we show to be dominated in terms of sequential scale-sensitive dimensions in a horizon-independent way, and, under additional complexity assumptions establish a tight control on worst-case sequential Rademacher complexity in terms of the integral of sequential scale-sensitive dimension. Finally, we establish a tight contraction inequality for worst-case sequential Rademacher complexity. The above constitutes the resolution of a number of outstanding open problems in extending the classical theory of empirical processes to the sequential case, and, in turn, establishes sharp results for online learning.

### On the Minimal Error of Empirical Risk Minimization (with G. Kur)

preprint, 2021

We study the minimal squared error of the Empirical Risk Minimization (ERM) procedure in the task of regression, both in random and fixed design settings. Our sharp lower bounds shed light on the possibility (or impossibility) of adapting to simplicity of the model generating the data. In the fixed design setting, we show that the error is governed by the global complexity of the entire class. In contrast, in random design, ERM may only adapt to simpler models if the local neighborhoods around the regression function are nearly as complex as the class itself, a somewhat counter-intuitive conclusion. We provide sharp lower bounds for performance of ERM in Donsker and non-Donsker classes. We also discuss our results through the lens of recent studies on interpolation in overparameterized models.

We study the minimal squared error of the Empirical Risk Minimization (ERM) procedure in the task of regression, both in random and fixed design settings. Our sharp lower bounds shed light on the possibility (or impossibility) of adapting to simplicity of the model generating the data. In the fixed design setting, we show that the error is governed by the global complexity of the entire class. In contrast, in random design, ERM may only adapt to simpler models if the local neighborhoods around the regression function are nearly as complex as the class itself, a somewhat counter-intuitive conclusion. We provide sharp lower bounds for performance of ERM in Donsker and non-Donsker classes. We also discuss our results through the lens of recent studies on interpolation in overparameterized models.

### Top-k eXtreme Contextual Bandits with Arm Hierarchy (with R. Sen, L. Ying, R. Kidambi, D. Foster, D. Hill, I. Dhillon)

preprint, 2021

Motivated by modern applications, such as online advertisement and recommender systems, we study the top-k extreme contextual bandits problem, where the total number of arms can be enormous, and the learner is allowed to select k arms and observe all or some of the rewards for the chosen arms. We first propose an algorithm for the non-extreme realizable setting, utilizing the Inverse Gap Weighting strategy for selecting multiple arms. We show that our algorithm has a regret guarantee of O(k\sqrt{(A-k+1)T \log (|F| T)}), where A is the total number of arms and F is the class containing the regression function, while only requiring \tilde{O}(A) computation per time step. In the extreme setting, where the total number of arms can be in the millions, we propose a practically-motivated arm hierarchy model that induces a certain structure in mean rewards to ensure statistical and computational efficiency. The hierarchical structure allows for an exponential reduction in the number of relevant arms for each context, thus resulting in a regret guarantee of O(k\sqrt{(\log A-k+1)T \log (|F|T)}). Finally, we implement our algorithm using a hierarchical linear function class and show superior performance with respect to well-known benchmarks on simulated bandit feedback experiments using extreme multi-label classification datasets. On a dataset with three million arms, our reduction scheme has an average inference time of only 7.9 milliseconds, which is a 100x improvement.

Motivated by modern applications, such as online advertisement and recommender systems, we study the top-k extreme contextual bandits problem, where the total number of arms can be enormous, and the learner is allowed to select k arms and observe all or some of the rewards for the chosen arms. We first propose an algorithm for the non-extreme realizable setting, utilizing the Inverse Gap Weighting strategy for selecting multiple arms. We show that our algorithm has a regret guarantee of O(k\sqrt{(A-k+1)T \log (|F| T)}), where A is the total number of arms and F is the class containing the regression function, while only requiring \tilde{O}(A) computation per time step. In the extreme setting, where the total number of arms can be in the millions, we propose a practically-motivated arm hierarchy model that induces a certain structure in mean rewards to ensure statistical and computational efficiency. The hierarchical structure allows for an exponential reduction in the number of relevant arms for each context, thus resulting in a regret guarantee of O(k\sqrt{(\log A-k+1)T \log (|F|T)}). Finally, we implement our algorithm using a hierarchical linear function class and show superior performance with respect to well-known benchmarks on simulated bandit feedback experiments using extreme multi-label classification datasets. On a dataset with three million arms, our reduction scheme has an average inference time of only 7.9 milliseconds, which is a 100x improvement.

### Instance-Dependent Complexity of Contextual Bandits and Reinforcement Learning: A Disagreement-Based Perspective (with D. Foster, D. Simchi-Levi, Y. Xu)

preprint, 2020

In the classical multi-armed bandit problem, instance-dependent algorithms attain improved performance on "easy" problems with a gap between the best and second-best arm. Are similar guarantees possible for contextual bandits? While positive results are known for certain special cases, there is no general theory characterizing when and how instance-dependent regret bounds for contextual bandits can be achieved for rich, general classes of policies. We introduce a family of complexity measures that are both sufficient and necessary to obtain instance-dependent regret bounds. We then introduce new oracle-efficient algorithms which adapt to the gap whenever possible, while also attaining the minimax rate in the worst case. Finally, we provide structural results that tie together a number of complexity measures previously proposed throughout contextual bandits, reinforcement learning, and active learning and elucidate their role in determining the optimal instance-dependent regret. In a large-scale empirical evaluation, we find that our approach often gives superior results for challenging exploration problems. Turning our focus to reinforcement learning with function approximation, we develop new oracle-efficient algorithms for reinforcement learning with rich observations that obtain optimal gap-dependent sample complexity.

In the classical multi-armed bandit problem, instance-dependent algorithms attain improved performance on "easy" problems with a gap between the best and second-best arm. Are similar guarantees possible for contextual bandits? While positive results are known for certain special cases, there is no general theory characterizing when and how instance-dependent regret bounds for contextual bandits can be achieved for rich, general classes of policies. We introduce a family of complexity measures that are both sufficient and necessary to obtain instance-dependent regret bounds. We then introduce new oracle-efficient algorithms which adapt to the gap whenever possible, while also attaining the minimax rate in the worst case. Finally, we provide structural results that tie together a number of complexity measures previously proposed throughout contextual bandits, reinforcement learning, and active learning and elucidate their role in determining the optimal instance-dependent regret. In a large-scale empirical evaluation, we find that our approach often gives superior results for challenging exploration problems. Turning our focus to reinforcement learning with function approximation, we develop new oracle-efficient algorithms for reinforcement learning with rich observations that obtain optimal gap-dependent sample complexity.

### Beyond UCB: Optimal and efficient contextual bandits with regression oracles (with D. Foster)

ICML, 2020

A fundamental challenge in contextual bandits is to develop flexible, general-purpose algorithms with computational requirements no worse than classical supervised learning tasks such as classification and regression. Algorithms based on regression have shown promising empirical success, but theoretical guarantees have remained elusive except in special cases. We provide the first universal and optimal reduction from contextual bandits to online regression. We show how to transform any oracle for online regression with a given value function class into an algorithm for contextual bandits with the induced policy class, with no overhead in runtime or memory requirements. We characterize the minimax rates for contextual bandits with general, potentially nonparametric function classes, and show that our algorithm is minimax optimal whenever the oracle obtains the optimal rate for regression. Compared to previous results, our algorithm requires no distributional assumptions beyond realizability, and works even when contexts are chosen adversarially.

A fundamental challenge in contextual bandits is to develop flexible, general-purpose algorithms with computational requirements no worse than classical supervised learning tasks such as classification and regression. Algorithms based on regression have shown promising empirical success, but theoretical guarantees have remained elusive except in special cases. We provide the first universal and optimal reduction from contextual bandits to online regression. We show how to transform any oracle for online regression with a given value function class into an algorithm for contextual bandits with the induced policy class, with no overhead in runtime or memory requirements. We characterize the minimax rates for contextual bandits with general, potentially nonparametric function classes, and show that our algorithm is minimax optimal whenever the oracle obtains the optimal rate for regression. Compared to previous results, our algorithm requires no distributional assumptions beyond realizability, and works even when contexts are chosen adversarially.

### On Suboptimality of Empirical Risk Minimization with Squared Loss with Application to Estimation of Convex Bodies (with G. Kur and A. Guntuboyina)

COLT, 2020

We develop a technique for establishing a lower bound on the sample complexity of Empirical Risk Minimization (ERM) with squared loss for large classes of functions. As an application, we settle an open problem regarding optimality of Least Squares Estimator (LSE) in estimating a convex set from noisy support function measurements in dimension $d\geq 6$. Specifically, we establish that the LSE is mimimax \textit{sub-optimal}, and achieves a rate of $\Tilde{\Theta}_d(n^{-2/(d-1)})$ whereas the minimax rate is $\Theta_d(n^{-4/(d+3)})$

We develop a technique for establishing a lower bound on the sample complexity of Empirical Risk Minimization (ERM) with squared loss for large classes of functions. As an application, we settle an open problem regarding optimality of Least Squares Estimator (LSE) in estimating a convex set from noisy support function measurements in dimension $d\geq 6$. Specifically, we establish that the LSE is mimimax \textit{sub-optimal}, and achieves a rate of $\Tilde{\Theta}_d(n^{-2/(d-1)})$ whereas the minimax rate is $\Theta_d(n^{-4/(d+3)})$

### Learning the Linear Quadratic Regulator from Nonlinear Observations (with Z. Mhammedi, D. Foster, M. Simchowitz, W. Sun, D. Mistra, A. Krishnamurthy, and J. Langford)

NeurIPS, 2020

We introduce a new problem setting for continuous control called the LQR with Rich Observations, or RichLQR. In our setting, the environment is summarized by a low-dimensional continuous latent state with linear dynamics and quadratic costs, but the agent operates on high-dimensional, nonlinear observations such as images from a camera. To enable sample-efficient learning, we assume that the learner has access to a class of decoder functions (e.g., neural networks) that is flexible enough to capture the mapping from observations to latent states. We introduce a new algorithm, RichID, which learns a near-optimal policy for the RichLQR with sample complexity scaling only with the dimension of the latent state space and the capacity of the decoder function class. RichID is oracle-efficient and accesses the decoder class only through calls to a least-squares regression oracle. Our results constitute the first provable sample complexity guarantee for continuous control with an unknown nonlinearity in the system model and general function approximation.

We introduce a new problem setting for continuous control called the LQR with Rich Observations, or RichLQR. In our setting, the environment is summarized by a low-dimensional continuous latent state with linear dynamics and quadratic costs, but the agent operates on high-dimensional, nonlinear observations such as images from a camera. To enable sample-efficient learning, we assume that the learner has access to a class of decoder functions (e.g., neural networks) that is flexible enough to capture the mapping from observations to latent states. We introduce a new algorithm, RichID, which learns a near-optimal policy for the RichLQR with sample complexity scaling only with the dimension of the latent state space and the capacity of the decoder function class. RichID is oracle-efficient and accesses the decoder class only through calls to a least-squares regression oracle. Our results constitute the first provable sample complexity guarantee for continuous control with an unknown nonlinearity in the system model and general function approximation.

### Fast Mixing of Multi-Scale Langevin Dynamics under the Manifold Hypothesis (with A. Block, Y. Mroueh, and J. Ross)

preprint, 2020

Recently, the task of image generation has attracted much attention. In particular, the recent empirical successes of the Markov Chain Monte Carlo (MCMC) technique of Langevin Dynamics have prompted a number of theoretical advances; despite this, several outstanding problems remain. First, the Langevin Dynamics is run in very high dimension on a nonconvex landscape; in the worst case, due to the NP-hardness of nonconvex optimization, it is thought that Langevin Dynamics mixes only in time exponential in the dimension. In this work, we demonstrate how the manifold hypothesis allows for the considerable reduction of mixing time, from exponential in the ambient dimension to depending only on the (much smaller) intrinsic dimension of the data. Second, the high dimension of the sampling space significantly hurts the performance of Langevin Dynamics; we leverage a multi-scale approach to help ameliorate this issue and observe that this multi-resolution algorithm allows for a trade-off between image quality and computational expense in generation.

Recently, the task of image generation has attracted much attention. In particular, the recent empirical successes of the Markov Chain Monte Carlo (MCMC) technique of Langevin Dynamics have prompted a number of theoretical advances; despite this, several outstanding problems remain. First, the Langevin Dynamics is run in very high dimension on a nonconvex landscape; in the worst case, due to the NP-hardness of nonconvex optimization, it is thought that Langevin Dynamics mixes only in time exponential in the dimension. In this work, we demonstrate how the manifold hypothesis allows for the considerable reduction of mixing time, from exponential in the ambient dimension to depending only on the (much smaller) intrinsic dimension of the data. Second, the high dimension of the sampling space significantly hurts the performance of Langevin Dynamics; we leverage a multi-scale approach to help ameliorate this issue and observe that this multi-resolution algorithm allows for a trade-off between image quality and computational expense in generation.

### Generative Modeling with Denoising Auto-Encoders and Langevin Sampling (with A. Block and Y. Mroueh)

preprint, 2020

We study convergence of a generative modeling method that first estimates the score function of the distribution using Denoising Auto-Encoders (DAE) or Denoising Score Matching (DSM) and then employs Langevin diffusion for sampling. We show that both DAE and DSM provide estimates of the score of the Gaussian smoothed population density, allowing us to apply the machinery of Empirical Processes. We overcome the challenge of relying only on L2 bounds on the score estimation error and provide finite-sample bounds in the Wasserstein distance between the law of the population distribution and the law of this sampling scheme. We then apply our results to the homotopy method of [SE19] and provide theoretical justification for its empirical success.

We study convergence of a generative modeling method that first estimates the score function of the distribution using Denoising Auto-Encoders (DAE) or Denoising Score Matching (DSM) and then employs Langevin diffusion for sampling. We show that both DAE and DSM provide estimates of the score of the Gaussian smoothed population density, allowing us to apply the machinery of Empirical Processes. We overcome the challenge of relying only on L2 bounds on the score estimation error and provide finite-sample bounds in the Wasserstein distance between the law of the population distribution and the law of this sampling scheme. We then apply our results to the homotopy method of [SE19] and provide theoretical justification for its empirical success.

### On the Multiple Descent of Minimum-Norm Interpolants and Restricted Lower Isometry of Kernels (with T. Liang and X. Zhai)

COLT, 2020

We study the risk of minimum-norm interpolants of data in Reproducing Kernel Hilbert Spaces. Our upper bounds on the risk are of a multiple-descent shape for the various scalings of d = n^\alpha, \alpha\in(0,1), for the input dimension d and sample size n. Empirical evidence supports our finding that minimum-norm interpolants in RKHS can exhibit this unusual non-monotonicity in sample size; furthermore, locations of the peaks in our experiments match our theoretical predictions. Since gradient flow on appropriately initialized wide neural networks converges to a minimum-norm interpolant with respect to a certain kernel, our analysis also yields novel estimation and generalization guarantees for these over-parametrized models. At the heart of our analysis is a study of spectral properties of the random kernel matrix restricted to a filtration of eigen-spaces of the population covariance operator, and may be of independent interest.

We study the risk of minimum-norm interpolants of data in Reproducing Kernel Hilbert Spaces. Our upper bounds on the risk are of a multiple-descent shape for the various scalings of d = n^\alpha, \alpha\in(0,1), for the input dimension d and sample size n. Empirical evidence supports our finding that minimum-norm interpolants in RKHS can exhibit this unusual non-monotonicity in sample size; furthermore, locations of the peaks in our experiments match our theoretical predictions. Since gradient flow on appropriately initialized wide neural networks converges to a minimum-norm interpolant with respect to a certain kernel, our analysis also yields novel estimation and generalization guarantees for these over-parametrized models. At the heart of our analysis is a study of spectral properties of the random kernel matrix restricted to a filtration of eigen-spaces of the population covariance operator, and may be of independent interest.

### Optimality of Maximum Likelihood for Log-Concave Density Estimation and Bounded Convex Regression (with G. Kur and Y. Dagan)

preprint, 2019

In this paper, we study two fundamentals problems: estimation of a d-dimensional log-concave distribution and bounded multivariate convex regression with random design. First, we show that for all d≥4 the maximum likelihood estimators of both problems achieve an optimal risk (up to a logarithmic factor) of Θ_d(n^{−2/(d+1)}) in terms of squared Hellinger distance and L2 squared distance, respectively. Previously, the optimality of both these estimators was known only for d≤3. We also prove that the ϵ-entropy numbers of the two aforementioned families are equal up to logarithmic factors. We complement these results by proving a sharp bound Θ_d(n^{−2/(d+4)}) on the minimax rate (up to logarithmic factors) with respect to the total variation distance. Finally, we prove that estimating a log-concave density---even a uniform distribution on a convex set---up to a fixed accuracy requires \emph{at least} a number of samples which is exponential in the dimension. We do that by improving the dimensional constant in the best known lower bound for the minimax rate from 2^{−d}⋅n^{−2/(d+1)} to c⋅n^{−2/(d+1)}.

In this paper, we study two fundamentals problems: estimation of a d-dimensional log-concave distribution and bounded multivariate convex regression with random design. First, we show that for all d≥4 the maximum likelihood estimators of both problems achieve an optimal risk (up to a logarithmic factor) of Θ_d(n^{−2/(d+1)}) in terms of squared Hellinger distance and L2 squared distance, respectively. Previously, the optimality of both these estimators was known only for d≤3. We also prove that the ϵ-entropy numbers of the two aforementioned families are equal up to logarithmic factors. We complement these results by proving a sharp bound Θ_d(n^{−2/(d+4)}) on the minimax rate (up to logarithmic factors) with respect to the total variation distance. Finally, we prove that estimating a log-concave density---even a uniform distribution on a convex set---up to a fixed accuracy requires \emph{at least} a number of samples which is exponential in the dimension. We do that by improving the dimensional constant in the best known lower bound for the minimax rate from 2^{−d}⋅n^{−2/(d+1)} to c⋅n^{−2/(d+1)}.

### Learning nonlinear dynamical systems from a single trajectory (with D. Foster and T. Sarkar)

Learning for Decision and Control conference, 2020

We introduce algorithms for learning nonlinear dynamical systems of the form $x_{t+1}=\sigma(\Theta^{\star}x_t)+\varepsilon_t$, where $\Theta^{\star}$ is a weight matrix, $\sigma$ is a nonlinear link function, and $\varepsilon_t$ is a mean-zero noise process. We give an algorithm that recovers the weight matrix $\Theta^{\star}$ from a single trajectory with optimal sample complexity and linear running time. The algorithm succeeds under weaker statistical assumptions than in previous work, and in particular i) does not require a bound on the spectral norm of the weight matrix $\Theta^{\star}$ (rather, it depends on a generalization of the spectral radius) and ii) enjoys guarantees for non-strictly-increasing link functions such as the ReLU. Our analysis has two key components: i) we give a general recipe whereby global stability for nonlinear dynamical systems can be used to certify that the state-vector covariance is well-conditioned, and ii) using these tools, we extend well-known algorithms for efficiently learning generalized linear models to the dependent setting.

We introduce algorithms for learning nonlinear dynamical systems of the form $x_{t+1}=\sigma(\Theta^{\star}x_t)+\varepsilon_t$, where $\Theta^{\star}$ is a weight matrix, $\sigma$ is a nonlinear link function, and $\varepsilon_t$ is a mean-zero noise process. We give an algorithm that recovers the weight matrix $\Theta^{\star}$ from a single trajectory with optimal sample complexity and linear running time. The algorithm succeeds under weaker statistical assumptions than in previous work, and in particular i) does not require a bound on the spectral norm of the weight matrix $\Theta^{\star}$ (rather, it depends on a generalization of the spectral radius) and ii) enjoys guarantees for non-strictly-increasing link functions such as the ReLU. Our analysis has two key components: i) we give a general recipe whereby global stability for nonlinear dynamical systems can be used to certify that the state-vector covariance is well-conditioned, and ii) using these tools, we extend well-known algorithms for efficiently learning generalized linear models to the dependent setting.

### Learning switched linear dynamical systems of unknown order (with T. Sarkar and M. Dahleh)

preprint, 2020

Many natural systems, such as neurons firing in the brain, a country experiencing recession or central bank intervention in response to a pandemic, give rise to complex, nonlinear dynamics. We can gain insight into these dynamics by effectively modeling them by a system that switches among a set of conditionally linear dynamical modes. In this work, we study a special case of such models: switched linear dynamical systems (SLDS) with unknown latent space dimension, or \textit{order}. In particular, we provide the first statistical guarantees for learning a SLDS from finite noisy input--output data. We demonstrate that existing algorithms for finite time LDS identification, in general, do not apply to our case. Then using tools from control and systems theory, we develop a regression-based algorithm that recovers the parameters of a SLDS. The key idea in our analysis comprises of an order selection step, based on data-dependent quantities, that enables us to circumvent issues arising from unknown order. We construct low order approximations of the underlying SLDS from noisy data; these approximations approach the true SLDS as data length increases.

Many natural systems, such as neurons firing in the brain, a country experiencing recession or central bank intervention in response to a pandemic, give rise to complex, nonlinear dynamics. We can gain insight into these dynamics by effectively modeling them by a system that switches among a set of conditionally linear dynamical modes. In this work, we study a special case of such models: switched linear dynamical systems (SLDS) with unknown latent space dimension, or \textit{order}. In particular, we provide the first statistical guarantees for learning a SLDS from finite noisy input--output data. We demonstrate that existing algorithms for finite time LDS identification, in general, do not apply to our case. Then using tools from control and systems theory, we develop a regression-based algorithm that recovers the parameters of a SLDS. The key idea in our analysis comprises of an order selection step, based on data-dependent quantities, that enables us to circumvent issues arising from unknown order. We construct low order approximations of the underlying SLDS from noisy data; these approximations approach the true SLDS as data length increases.

### Consistency of Interpolation with Laplace Kernels is a High-Dimensional Phenomenon (with X. Zhai)

COLT, 2019

We show that minimum-norm interpolation in the Reproducing Kernel Hilbert Space corresponding to the Laplace kernel is not consistent if input dimension is constant. The lower bound holds for any choice of kernel bandwidth, even if selected based on data. The result supports the empirical observation that minimum-norm interpolation (that is, exact fit to training data) in RKHS generalizes well for some high-dimensional datasets, but not for low-dimensional ones.

We show that minimum-norm interpolation in the Reproducing Kernel Hilbert Space corresponding to the Laplace kernel is not consistent if input dimension is constant. The lower bound holds for any choice of kernel bandwidth, even if selected based on data. The result supports the empirical observation that minimum-norm interpolation (that is, exact fit to training data) in RKHS generalizes well for some high-dimensional datasets, but not for low-dimensional ones.

### Just Interpolate: Kernel "Ridgeless" Regression Can Generalize (with T. Liang)

The Annals of Statistics, 2020

In the absence of explicit regularization, Kernel ``Ridgeless'' Regression with nonlinear kernels has the potential to fit the training data perfectly. It has been observed empirically, however, that such interpolated solutions can still generalize well on test data. We isolate a phenomenon of implicit regularization for minimum-norm interpolated solutions which is due to a combination of high dimensionality of the input data, curvature of the kernel function, and favorable geometric properties of the data such as an eigenvalue decay of the empirical covariance and kernel matrices. In addition to deriving a data-dependent upper bound on the out-of-sample error, we present experimental evidence suggesting that the phenomenon occurs in the MNIST dataset.

In the absence of explicit regularization, Kernel ``Ridgeless'' Regression with nonlinear kernels has the potential to fit the training data perfectly. It has been observed empirically, however, that such interpolated solutions can still generalize well on test data. We isolate a phenomenon of implicit regularization for minimum-norm interpolated solutions which is due to a combination of high dimensionality of the input data, curvature of the kernel function, and favorable geometric properties of the data such as an eigenvalue decay of the empirical covariance and kernel matrices. In addition to deriving a data-dependent upper bound on the out-of-sample error, we present experimental evidence suggesting that the phenomenon occurs in the MNIST dataset.

### Finite-Time System Identification for Partially Observed LTI Systems of Unknown Order (with T. Sarkar and M. Dahleh)

JMLR, 2019

We address the problem of learning the parameters of a stable linear time invariant (LTI) system with unknown latent space dimension, or order, from its noisy input-output data. In this work, we focus on learning the parameters of the best lower order approximation allowed by the finite data. This is achieved by constructing a Hankel-like representation of the underlying system using ordinary least squares. Such a representation circumvents the non-convexities that typically arise in system identification, and it allows accurate estimation of the underlying LTI system. Our results rely on a careful analysis of a self-normalized martingale difference term that helps bound identification error up to logarithmic factors of the lower bound. We provide a data-dependent scheme for order selection and find a realization of system parameters, corresponding to that order, by an approach that is closely related to the celebrated Kalman-Ho subspace algorithm. We show that this realization is a good approximation of the underlying LTI system with high probability. Finally, we demonstrate that the proposed model order selection procedure is minimax optimal, i.e. for the given data length it is not always possible to estimate higher order models or find higher order approximations with reasonable accuracy.

We address the problem of learning the parameters of a stable linear time invariant (LTI) system with unknown latent space dimension, or order, from its noisy input-output data. In this work, we focus on learning the parameters of the best lower order approximation allowed by the finite data. This is achieved by constructing a Hankel-like representation of the underlying system using ordinary least squares. Such a representation circumvents the non-convexities that typically arise in system identification, and it allows accurate estimation of the underlying LTI system. Our results rely on a careful analysis of a self-normalized martingale difference term that helps bound identification error up to logarithmic factors of the lower bound. We provide a data-dependent scheme for order selection and find a realization of system parameters, corresponding to that order, by an approach that is closely related to the celebrated Kalman-Ho subspace algorithm. We show that this realization is a good approximation of the underlying LTI system with high probability. Finally, we demonstrate that the proposed model order selection procedure is minimax optimal, i.e. for the given data length it is not always possible to estimate higher order models or find higher order approximations with reasonable accuracy.

### Near optimal finite time guarantees for arbitrary linear dynamical systems (with T. Sarkar)

ICML, 2019

We derive finite time error bounds for estimating general linear time-invariant (LTI) systems from a single observed trajectory using the method of least squares. We provide the first analysis of the general case when eigenvalues of the LTI system are arbitrarily distributed in three regimes: stable, marginally stable, and explosive. Our analysis yields sharp upper bounds for each of these cases separately. We observe that although the underlying process behaves quite differently in each of these three regimes, the systematic analysis of a self--normalized martingale difference term helps bound identification error up to logarithmic factors of the lower bound. On the other hand, we demonstrate that the least squares solution may be statistically inconsistent under certain conditions even when the signal-to-noise ratio is high.

We derive finite time error bounds for estimating general linear time-invariant (LTI) systems from a single observed trajectory using the method of least squares. We provide the first analysis of the general case when eigenvalues of the LTI system are arbitrarily distributed in three regimes: stable, marginally stable, and explosive. Our analysis yields sharp upper bounds for each of these cases separately. We observe that although the underlying process behaves quite differently in each of these three regimes, the systematic analysis of a self--normalized martingale difference term helps bound identification error up to logarithmic factors of the lower bound. On the other hand, we demonstrate that the least squares solution may be statistically inconsistent under certain conditions even when the signal-to-noise ratio is high.

### Does data interpolation contradict statistical optimality? (with M. Belkin and A. Tsybakov)

AISTATS, 2019

We show that learning methods interpolating the training data can achieve optimal rates for the problems of nonparametric regression and prediction with square loss.

We show that learning methods interpolating the training data can achieve optimal rates for the problems of nonparametric regression and prediction with square loss.

### Online Learning: Sufficient Statistics and the Burkholder Method (with D. Foster and K. Sridharan)

COLT, 2018

We uncover a fairly general principle in online learning: If regret can be (approximately) expressed as a function of certain "sufficient statistics" for the data sequence, then there exists a special Burkholder function that 1) can be used algorithmically to achieve the regret bound and 2) only depends on these sufficient statistics, not the entire data sequence, so that the online strategy is only required to keep the sufficient statistics in memory. This characterization is achieved by bringing the full power of the Burkholder Method --- originally developed for certifying probabilistic martingale inequalities --- to bear on the online learning setting.

To demonstrate the scope and effectiveness of the Burkholder method, we develop a novel online strategy for matrix prediction that attains a regret bound corresponding to the variance term in matrix concentration inequalities. We also present a linear-time/space prediction strategy for parameter free supervised learning with linear classes and general smooth norms.

We uncover a fairly general principle in online learning: If regret can be (approximately) expressed as a function of certain "sufficient statistics" for the data sequence, then there exists a special Burkholder function that 1) can be used algorithmically to achieve the regret bound and 2) only depends on these sufficient statistics, not the entire data sequence, so that the online strategy is only required to keep the sufficient statistics in memory. This characterization is achieved by bringing the full power of the Burkholder Method --- originally developed for certifying probabilistic martingale inequalities --- to bear on the online learning setting.

To demonstrate the scope and effectiveness of the Burkholder method, we develop a novel online strategy for matrix prediction that attains a regret bound corresponding to the variance term in matrix concentration inequalities. We also present a linear-time/space prediction strategy for parameter free supervised learning with linear classes and general smooth norms.

### Size-Independent Sample Complexity of Neural Networks (with N. Golowich and O. Shamir)

COLT, 2018

We study the sample complexity of learning neural networks, by providing new bounds on their Rademacher complexity assuming norm constraints on the parameter matrix of each layer. Compared to previous work, these complexity bounds have improved dependence on the network depth, and under some additional assumptions, are fully independent of the network size (both depth and width). These results are derived using some novel techniques, which may be of independent interest.

We study the sample complexity of learning neural networks, by providing new bounds on their Rademacher complexity assuming norm constraints on the parameter matrix of each layer. Compared to previous work, these complexity bounds have improved dependence on the network depth, and under some additional assumptions, are fully independent of the network size (both depth and width). These results are derived using some novel techniques, which may be of independent interest.

### Fisher-Rao Metric, Geometry, and Complexity of Neural Networks (with T. Liang, T. Poggio, and J. Stokes)

AISTATS, 2019

We study the relationship between geometry and capacity measures for deep neural networks from an invariance viewpoint. We introduce a new notion of capacity --- the Fisher-Rao norm --- that possesses desirable invariance properties and is motivated by Information Geometry. We discover an analytical characterization of the new capacity measure, through which we establish norm-comparison inequalities and further show that the new measure serves as an umbrella for several existing norm-based complexity measures. We discuss upper bounds on the generalization error induced by the proposed measure. Extensive numerical experiments on CIFAR-10 support our theoretical findings. Our theoretical analysis rests on a key structural lemma about partial derivatives of multi-layer rectifier networks.

We study the relationship between geometry and capacity measures for deep neural networks from an invariance viewpoint. We introduce a new notion of capacity --- the Fisher-Rao norm --- that possesses desirable invariance properties and is motivated by Information Geometry. We discover an analytical characterization of the new capacity measure, through which we establish norm-comparison inequalities and further show that the new measure serves as an umbrella for several existing norm-based complexity measures. We discuss upper bounds on the generalization error induced by the proposed measure. Extensive numerical experiments on CIFAR-10 support our theoretical findings. Our theoretical analysis rests on a key structural lemma about partial derivatives of multi-layer rectifier networks.

### Weighted Message Passing and Minimum Energy Flow for Heterogeneous Stochastic Block Models with Side Information (with T. Cai and T. Liang)

Journal of Machine Learning Research, 2018

We study the misclassification error for community detection in general heterogeneous stochastic block models (SBM) with noisy or partial label information. We establish a connection between the misclassification rate and the notion of minimum energy on the local neighborhood of the SBM. We develop an optimally weighted message passing algorithm to reconstruct labels for SBM based on the minimum energy flow and the eigenvectors of a certain Markov transition matrix. The general SBM considered in this paper allows for unequal-size communities, degree heterogeneity, and different connection probabilities among blocks. We focus on how to optimally weigh the message passing to improve misclassification.

We study the misclassification error for community detection in general heterogeneous stochastic block models (SBM) with noisy or partial label information. We establish a connection between the misclassification rate and the notion of minimum energy on the local neighborhood of the SBM. We develop an optimally weighted message passing algorithm to reconstruct labels for SBM based on the minimum energy flow and the eigenvectors of a certain Markov transition matrix. The general SBM considered in this paper allows for unequal-size communities, degree heterogeneity, and different connection probabilities among blocks. We focus on how to optimally weigh the message passing to improve misclassification.

### Non-Convex Learning via Stochastic Gradient Langevin Dynamics: A Nonasymptotic Analysis (with M. Raginsky and M. Telgarsky)

COLT, 2017

Stochastic Gradient Langevin Dynamics (SGLD) is a popular variant of Stochastic Gradient Descent, where properly scaled isotropic Gaussian noise is added to an unbiased estimate of the gradient at each iteration. This modest change allows SGLD to escape local minima and suffices to guarantee asymptotic convergence to global minimizers for sufficiently regular non-convex objectives (Gelfand and Mitter, 91).

The present work provides a nonasymptotic analysis in the context of non-convex learning problems, giving finite-time guarantees for SGLD to find approximate minimizers of both empirical and population risks.

As in the asymptotic setting, our analysis relates the discrete-time SGLD Markov chain to a continuous-time diffusion process. A new tool that drives the results is the use of weighted transportation cost inequalities to quantify the rate of convergence of SGLD to a stationary distribution in the Euclidean $2$-Wasserstein distance.

Stochastic Gradient Langevin Dynamics (SGLD) is a popular variant of Stochastic Gradient Descent, where properly scaled isotropic Gaussian noise is added to an unbiased estimate of the gradient at each iteration. This modest change allows SGLD to escape local minima and suffices to guarantee asymptotic convergence to global minimizers for sufficiently regular non-convex objectives (Gelfand and Mitter, 91).

The present work provides a nonasymptotic analysis in the context of non-convex learning problems, giving finite-time guarantees for SGLD to find approximate minimizers of both empirical and population risks.

As in the asymptotic setting, our analysis relates the discrete-time SGLD Markov chain to a continuous-time diffusion process. A new tool that drives the results is the use of weighted transportation cost inequalities to quantify the rate of convergence of SGLD to a stationary distribution in the Euclidean $2$-Wasserstein distance.

### ZigZag: A new approach to adaptive online learning (with D. Foster and K. Sridharan)

COLT, 2017

We develop a new family of algorithms for the online learning setting with regret against any data sequence bounded by the

We realize our general theory by giving new efficient and adaptive algorithms for classes including l_p norms, group norms, and reproducing kernel Hilbert spaces. The empirical Rademacher complexity regret bound implies --- when used in the i.i.d. setting --- a

In addition to obtaining tight data-dependent regret bounds, our algorithms enjoy improved efficiency over previous techniques based on Rademacher complexity, automatically work in the infinite horizon setting, and adapt to scale. To obtain such adaptive methods, we introduce novel machinery, and the resulting algorithms are not based on the standard tools of online convex optimization. We conclude with a number of open problems and new directions, both algorithmic and information-theoretic.

We develop a new family of algorithms for the online learning setting with regret against any data sequence bounded by the

*empirical Rademacher complexity*of that sequence. To develop a general theory of when this type of adaptive regret bound is achievable we establish a connection to the theory of

*decoupling inequalities*for martingales in Banach spaces. When the hypothesis class is a set of linear functions bounded in some norm, such a regret bound is achievable if and only if the norm satisfies certain decoupling inequalities for martingales. Donald Burkholder's celebrated

*geometric characterization*of decoupling inequalities (Burkholder, 84) states that such an inequality holds if and only if there exists a special function called a

*Burkholder function*satisfying certain restricted concavity properties. Our online learning algorithms are efficient in terms of queries to this function.

We realize our general theory by giving new efficient and adaptive algorithms for classes including l_p norms, group norms, and reproducing kernel Hilbert spaces. The empirical Rademacher complexity regret bound implies --- when used in the i.i.d. setting --- a

*data-dependent*complexity bound for excess risk after online-to-batch conversion. To showcase the power of the empirical Rademacher complexity regret bound, we derive improved rates for a supervised learning generalization of the

*online learning with low rank experts*task and for the

*online matrix prediction*task.

In addition to obtaining tight data-dependent regret bounds, our algorithms enjoy improved efficiency over previous techniques based on Rademacher complexity, automatically work in the infinite horizon setting, and adapt to scale. To obtain such adaptive methods, we introduce novel machinery, and the resulting algorithms are not based on the standard tools of online convex optimization. We conclude with a number of open problems and new directions, both algorithmic and information-theoretic.

### On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities (with K. Sridharan)

COLT, 2017

We study an equivalence of (i) deterministic pathwise statements appearing in the online learning literature (termed

We study an equivalence of (i) deterministic pathwise statements appearing in the online learning literature (termed

*regret bounds*), (ii) high-probability tail bounds for the supremum of a collection of martingales (of a specific form arising from uniform laws of large numbers for martingales), and (iii) in-expectation bounds for the supremum. By virtue of the equivalence, we prove exponential tail bounds for norms of Banach space valued martingales via deterministic regret bounds for the online mirror descent algorithm with an adaptive step size. We extend these results beyond the linear structure of the Banach space: we define a notion of

*martingale type*for general classes of real-valued functions and show its equivalence (up to a logarithmic factor) to various sequential complexities of the class (in particular, the sequential Rademacher complexity and its offset version). For classes with the general martingale type 2, we exhibit a finer notion of variation that allows partial adaptation to the function indexing the martingale. Our proof technique rests on sequential symmetrization and on certifying the

*existence*of regret minimization strategies for certain online prediction problems.

### Efficient Multiclass Prediction on Graphs via Surrogate Losses (with K. Sridharan)

AISTATS, 2017

We develop computationally efficient algorithms for online multi-class prediction. Our construction is based on carefully-chosen data-dependent surrogate loss functions, and the new methods enjoy strong mistake bound guarantees.

To illustrate the technique, we study the combinatorial problem of node classification and develop a prediction strategy that is linear-time per round. In contrast, the offline benchmark is NP-hard to compute in general. We demonstrate the empirical performance of the method on several datasets.

We develop computationally efficient algorithms for online multi-class prediction. Our construction is based on carefully-chosen data-dependent surrogate loss functions, and the new methods enjoy strong mistake bound guarantees.

To illustrate the technique, we study the combinatorial problem of node classification and develop a prediction strategy that is linear-time per round. In contrast, the offline benchmark is NP-hard to compute in general. We demonstrate the empirical performance of the method on several datasets.

### On Detection and Structural Reconstruction of Small-World Random Networks (with T. Cai and T. Liang)

IEEE Transactions on Network Science and Engineering, 2017

In this paper, we study detection and fast reconstruction of the celebrated Watts-Strogatz (WS) small-world random graph model (Watts and Strogatz '98) which aims to describe real-world complex networks that exhibit both high clustering and short average length properties. The WS model with neighborhood size k and rewiring probability probability β can be viewed as a continuous interpolation between a deterministic ring lattice graph and the Erdős-Rényi random graph. We study both the computational and statistical aspects of detecting the deterministic ring lattice structure (or local geographical links, strong ties) in the presence of random connections (or long range links, weak ties), and for its recovery. The phase diagram in terms of (k,β) is partitioned into several regions according to the difficulty of the problem. We propose distinct methods for the various regions.

In this paper, we study detection and fast reconstruction of the celebrated Watts-Strogatz (WS) small-world random graph model (Watts and Strogatz '98) which aims to describe real-world complex networks that exhibit both high clustering and short average length properties. The WS model with neighborhood size k and rewiring probability probability β can be viewed as a continuous interpolation between a deterministic ring lattice graph and the Erdős-Rényi random graph. We study both the computational and statistical aspects of detecting the deterministic ring lattice structure (or local geographical links, strong ties) in the presence of random connections (or long range links, weak ties), and for its recovery. The phase diagram in terms of (k,β) is partitioned into several regions according to the difficulty of the problem. We propose distinct methods for the various regions.

### A Tutorial on Online Supervised Learning with Applications to Node Classification in Social Networks (with K. Sridharan)

preprint, 2016

We revisit the elegant observation of T. Cover '65 which, perhaps, is not as well-known to the broader community as it should be. The first goal of the tutorial is to explain---through the prism of this elementary result---how to solve certain sequence prediction problems by modeling sets of solutions rather than the unknown data-generating mechanism. We extend Cover's observation in several directions and focus on computational aspects of the proposed algorithms. The applicability of the methods is illustrated on several examples, including node classification in a network.

The second aim of this tutorial is to demonstrate the following phenomenon: it is possible to predict as well as a combinatorial ``benchmark'' for which we have a certain multiplicative approximation algorithm, even if the exact computation of the benchmark given all the data is NP-hard. The proposed prediction methods, therefore, circumvent some of the computational difficulties associated with finding the best model given the data. These difficulties arise rather quickly when one attempts to develop a probabilistic model for graph-based or other problems with a combinatorial structure.

We revisit the elegant observation of T. Cover '65 which, perhaps, is not as well-known to the broader community as it should be. The first goal of the tutorial is to explain---through the prism of this elementary result---how to solve certain sequence prediction problems by modeling sets of solutions rather than the unknown data-generating mechanism. We extend Cover's observation in several directions and focus on computational aspects of the proposed algorithms. The applicability of the methods is illustrated on several examples, including node classification in a network.

The second aim of this tutorial is to demonstrate the following phenomenon: it is possible to predict as well as a combinatorial ``benchmark'' for which we have a certain multiplicative approximation algorithm, even if the exact computation of the benchmark given all the data is NP-hard. The proposed prediction methods, therefore, circumvent some of the computational difficulties associated with finding the best model given the data. These difficulties arise rather quickly when one attempts to develop a probabilistic model for graph-based or other problems with a combinatorial structure.

### Information-Theoretic Analysis of Stability and Bias of Learning Algorithms (with M. Raginsky, M. Tsao, Y. Wu, and A. Xu)

IEEE Information Theory Workshop, 2016

Machine learning algorithms can be viewed as stochastic transformations that map training data to hypotheses. Following Bousquet and Elisseeff, we say that such an algorithm is stable if its output does not depend too much on any individual training example. Since stability is closely connected to generalization capabilities of learning algorithms, it is of theoretical and practical interest to obtain sharp quantitative estimates on the generalization bias of machine learning algorithms in terms of their stability properties. We propose several information-theoretic measures of algorithmic stability and use them to upper-bound the generalization bias of learning algorithms. Our framework is complementary to the information-theoretic methodology developed recently by Russo and Zou.

Machine learning algorithms can be viewed as stochastic transformations that map training data to hypotheses. Following Bousquet and Elisseeff, we say that such an algorithm is stable if its output does not depend too much on any individual training example. Since stability is closely connected to generalization capabilities of learning algorithms, it is of theoretical and practical interest to obtain sharp quantitative estimates on the generalization bias of machine learning algorithms in terms of their stability properties. We propose several information-theoretic measures of algorithmic stability and use them to upper-bound the generalization bias of learning algorithms. Our framework is complementary to the information-theoretic methodology developed recently by Russo and Zou.

### Computational and Statistical Boundaries for Submatrix Localization in a Large Noisy Matrix (with T. Cai and T. Liang)

The Annals of Statistics, 2016

The interplay between computational efficiency and statistical accuracy in high-dimensional inference has drawn increasing attention in the literature. In this paper, we study computational and statistical boundaries for submatrix localization. Given one observation of a signal submatrix (of magnitude λ and size k_m x k_n) contaminated with a noise matrix (of size m x n), we establish two transition thresholds for the signal to noise λ/σ ratio in terms of m, n, k_m, and k_n. The first threshold, SNR_c, corresponds to the computational boundary. Below this threshold, it is shown that no polynomial time algorithm can succeed in identifying the submatrix, under the hidden clique hypothesis. We introduce an adaptive linear time algorithm that identifies the submatrix with high probability when the signal strength is above the threshold SNR_c. The second threshold, SNR_s, captures the statistical boundary, below which no method can succeed with probability going to one in the minimax sense. The exhaustive search method successfully finds the submatrix above this threshold. The results show an interesting phenomenon that SNR_c is always significantly larger than SNR_s, which implies an essential gap between statistical optimality and computational efficiency for submatrix localization.

The interplay between computational efficiency and statistical accuracy in high-dimensional inference has drawn increasing attention in the literature. In this paper, we study computational and statistical boundaries for submatrix localization. Given one observation of a signal submatrix (of magnitude λ and size k_m x k_n) contaminated with a noise matrix (of size m x n), we establish two transition thresholds for the signal to noise λ/σ ratio in terms of m, n, k_m, and k_n. The first threshold, SNR_c, corresponds to the computational boundary. Below this threshold, it is shown that no polynomial time algorithm can succeed in identifying the submatrix, under the hidden clique hypothesis. We introduce an adaptive linear time algorithm that identifies the submatrix with high probability when the signal strength is above the threshold SNR_c. The second threshold, SNR_s, captures the statistical boundary, below which no method can succeed with probability going to one in the minimax sense. The exhaustive search method successfully finds the submatrix above this threshold. The results show an interesting phenomenon that SNR_c is always significantly larger than SNR_s, which implies an essential gap between statistical optimality and computational efficiency for submatrix localization.

### BISTRO: An Efficient Relaxation-Based Method for Contextual Bandits (with K. Sridharan)

ICML, 2016

We present efficient algorithms for the problem of contextual bandits with i.i.d. covariates, an arbitrary sequence of rewards, and an arbitrary class of policies. Our algorithm BISTRO requires d calls to the empirical risk minimization (ERM) oracle per round, where d is the number of actions. The method uses unlabeled data to make the problem computationally simple. When the ERM problem itself is computationally hard, we extend the approach by employing multiplicative approximation algorithms for the ERM. The integrality gap of the relaxation only enters in the regret bound rather than the benchmark. Finally, we show that the adversarial version of the contextual bandit problem is learnable (and efficient) whenever the full-information supervised online learning problem has a non-trivial regret guarantee (and efficient).

We present efficient algorithms for the problem of contextual bandits with i.i.d. covariates, an arbitrary sequence of rewards, and an arbitrary class of policies. Our algorithm BISTRO requires d calls to the empirical risk minimization (ERM) oracle per round, where d is the number of actions. The method uses unlabeled data to make the problem computationally simple. When the ERM problem itself is computationally hard, we extend the approach by employing multiplicative approximation algorithms for the ERM. The integrality gap of the relaxation only enters in the regret bound rather than the benchmark. Finally, we show that the adversarial version of the contextual bandit problem is learnable (and efficient) whenever the full-information supervised online learning problem has a non-trivial regret guarantee (and efficient).

### Geometric Inference for General High-Dimensional Linear Inverse Problems (with T. Cai and T. Liang)

The Annals of Statistics, 2016

This paper presents a unified geometric framework for the statistical analysis of a general ill-posed linear inverse model which includes as special cases noisy compressed sensing, sign vector recovery, trace regression, orthogonal matrix estimation, and noisy matrix completion. We propose computationally feasible convex programs for statistical inference including estimation, confidence intervals and hypothesis testing. A theoretical framework is developed to characterize the local estimation rate of convergence and to provide statistical inference guarantees. Our results are built based on the local conic geometry and duality. The difficulty of statistical inference is captured by the geometric characterization of the local tangent cone through the Gaussian width and Sudakov estimate.

This paper presents a unified geometric framework for the statistical analysis of a general ill-posed linear inverse model which includes as special cases noisy compressed sensing, sign vector recovery, trace regression, orthogonal matrix estimation, and noisy matrix completion. We propose computationally feasible convex programs for statistical inference including estimation, confidence intervals and hypothesis testing. A theoretical framework is developed to characterize the local estimation rate of convergence and to provide statistical inference guarantees. Our results are built based on the local conic geometry and duality. The difficulty of statistical inference is captured by the geometric characterization of the local tangent cone through the Gaussian width and Sudakov estimate.

### Inference via Message Passing on Partially Labeled Stochastic Block Models (with T. Cai and T. Liang)

preprint, 2016

We study the community detection and recovery problem in partially-labeled stochastic block models (SBM). We develop a fast linearized message-passing algorithm to reconstruct labels for SBM (with n nodes, k blocks, p,q intra and inter block connectivity) when δ proportion of node labels are revealed. The signal-to-noise ratio SNR(n,k,p,q,δ) is shown to characterize the fundamental limitations of inference via local algorithms. On the one hand, when SNR>1, the linearized message-passing algorithm provides the statistical inference guarantee with mis-classification rate at most exp(-(SNR-1)/2), thus interpolating smoothly between strong and weak consistency. This exponential dependence improves upon the known error rate (SNR-1)^{-1} in the literature on weak recovery. On the other hand, when SNR<1 (for k=2) and SNR<¼ (for general growing k, we prove that local algorithms suffer an error rate at least ½ - √ δ SNR , which is only slightly better than random guess for small δ.

We study the community detection and recovery problem in partially-labeled stochastic block models (SBM). We develop a fast linearized message-passing algorithm to reconstruct labels for SBM (with n nodes, k blocks, p,q intra and inter block connectivity) when δ proportion of node labels are revealed. The signal-to-noise ratio SNR(n,k,p,q,δ) is shown to characterize the fundamental limitations of inference via local algorithms. On the one hand, when SNR>1, the linearized message-passing algorithm provides the statistical inference guarantee with mis-classification rate at most exp(-(SNR-1)/2), thus interpolating smoothly between strong and weak consistency. This exponential dependence improves upon the known error rate (SNR-1)^{-1} in the literature on weak recovery. On the other hand, when SNR<1 (for k=2) and SNR<¼ (for general growing k, we prove that local algorithms suffer an error rate at least ½ - √ δ SNR , which is only slightly better than random guess for small δ.

### Adaptive Online Learning (with D. Foster and K. Sridharan)

NIPS, 2015

We propose a general framework for studying adaptive regret bounds in the online learning framework, including model selection bounds and data-dependent bounds. Given a data- or model-dependent bound we ask, "Does there exist some algorithm achieving this bound?" We show that modifications to recently introduced sequential complexity measures can be used to answer this question by providing sufficient conditions under which adaptive rates can be achieved. In particular each adaptive rate induces a set of so-called offset complexity measures, and obtaining small upper bounds on these quantities is sufficient to demonstrate achievability. A cornerstone of our analysis technique is the use of one-sided tail inequalities to bound suprema of offset random processes.

Our framework recovers and improves a wide variety of adaptive bounds including quantile bounds, second-order data-dependent bounds, and small loss bounds. In addition we derive a new type of adaptive bound for online linear optimization based on the spectral norm, as well as a new online PAC-Bayes theorem that holds for countably infinite sets.

We propose a general framework for studying adaptive regret bounds in the online learning framework, including model selection bounds and data-dependent bounds. Given a data- or model-dependent bound we ask, "Does there exist some algorithm achieving this bound?" We show that modifications to recently introduced sequential complexity measures can be used to answer this question by providing sufficient conditions under which adaptive rates can be achieved. In particular each adaptive rate induces a set of so-called offset complexity measures, and obtaining small upper bounds on these quantities is sufficient to demonstrate achievability. A cornerstone of our analysis technique is the use of one-sided tail inequalities to bound suprema of offset random processes.

Our framework recovers and improves a wide variety of adaptive bounds including quantile bounds, second-order data-dependent bounds, and small loss bounds. In addition we derive a new type of adaptive bound for online linear optimization based on the spectral norm, as well as a new online PAC-Bayes theorem that holds for countably infinite sets.

### Escaping the Local Minima via Simulated Annealing: Optimization of Approximately Convex Functions (with A. Belloni, H. Narayanan, and T. Liang)

COLT, 2015

We consider the problem of optimizing an approximately convex function over a bounded convex set in R^n using only function evaluations. The problem is reduced to sampling from an

We consider the problem of optimizing an approximately convex function over a bounded convex set in R^n using only function evaluations. The problem is reduced to sampling from an

*approximately*log-concave distribution using the Hit-and-Run method, with query complexity of O^*(n^4.5). In the context of zeroth order stochastic convex optimization, the proposed method produces an ε-minimizer after O^*(n^7.5 ε^{-2}) noisy function evaluations by inducing a O(ε/n)-approximately log concave distribution. We also consider the case when the ``amount of non-convexity'' decays towards the optimum of the function. Other applications of the random walk method include private computation of empirical risk minimizers, two-stage stochastic programming, and approximate dynamic programming for online learning.

### Hierarchies of Relaxations for Online Prediction Problems with Evolving Constraints (with K. Sridharan)

COLT, 2015

We study online prediction where regret of the algorithm is measured against a benchmark defined via evolving constraints. This framework captures online prediction on graphs, as well as other prediction problems with combinatorial structure. A key aspect here is that finding the optimal benchmark predictor (even in hindsight, given all the data) might be computationally hard due to the combinatorial nature of the constraints. Despite this, we provide polynomial-time

We study online prediction where regret of the algorithm is measured against a benchmark defined via evolving constraints. This framework captures online prediction on graphs, as well as other prediction problems with combinatorial structure. A key aspect here is that finding the optimal benchmark predictor (even in hindsight, given all the data) might be computationally hard due to the combinatorial nature of the constraints. Despite this, we provide polynomial-time

*prediction*algorithms that achieve low regret against combinatorial benchmark sets. We do so by building improper learning algorithms based on two ideas that work together. The first is to alleviate part of the computational burden through random playout, and the second is to employ Lasserre semidefinite hierarchies to approximate the resulting integer program. Interestingly, for our prediction algorithms, we only need to compute the values of the semidefinite programs and not the rounded solutions. However, the integrality gap for Lasserre hierarchy

*does*enter the generic regret bound in terms of Rademacher complexity of the benchmark set. This establishes a trade-off between the computation time and the regret bound of the algorithm.

### Learning with Square Loss: Localization through Offset Rademacher Complexity (with T. Liang and K. Sridharan)

COLT, 2015

We consider regression with square loss and general classes of functions without the boundedness assumption. We introduce a notion of offset Rademacher complexity that provides a transparent way to study localization both in expectation and in high probability. For any (possibly non-convex) class, the excess loss of a two-step estimator is shown to be upper bounded by this offset complexity through a novel geometric inequality. In the convex case, the estimator reduces to an empirical risk minimizer. The method recovers the results of (Rakhlin, Sridharan, Tsybakov 2015) for the bounded case while also providing guarantees without the boundedness assumption.

We consider regression with square loss and general classes of functions without the boundedness assumption. We introduce a notion of offset Rademacher complexity that provides a transparent way to study localization both in expectation and in high probability. For any (possibly non-convex) class, the excess loss of a two-step estimator is shown to be upper bounded by this offset complexity through a novel geometric inequality. In the convex case, the estimator reduces to an empirical risk minimizer. The method recovers the results of (Rakhlin, Sridharan, Tsybakov 2015) for the bounded case while also providing guarantees without the boundedness assumption.

### Efficient Sampling from Time-Varying Distributions (with H. Narayanan)

Journal of Machine Learning Research, 2015

We propose a computationally efficient random walk on a convex body which rapidly mixes and closely tracks a time-varying log-concave distribution. We develop general theoretical guarantees on the required number of steps; this number can be calculated on the fly according to the distance from and the shape of the next distribution. We then illustrate the technique on several examples. Within the context of exponential families, the proposed method produces samples from a posterior distribution which is updated as data arrive in a streaming fashion. The sampling technique can be used to track time-varying truncated distributions, as well as to obtain samples from a changing mixture model, fitted in a streaming fashion to data. In the setting of linear optimization, the proposed method has oracle complexity with best known dependence on the dimension for certain geometries. In the context of online learning and repeated games, the algorithm is an efficient method for implementing no-regret mixture forecasting strategies. Remarkably, in some of these examples, only one step of the random walk is needed to track the next distribution.

We propose a computationally efficient random walk on a convex body which rapidly mixes and closely tracks a time-varying log-concave distribution. We develop general theoretical guarantees on the required number of steps; this number can be calculated on the fly according to the distance from and the shape of the next distribution. We then illustrate the technique on several examples. Within the context of exponential families, the proposed method produces samples from a posterior distribution which is updated as data arrive in a streaming fashion. The sampling technique can be used to track time-varying truncated distributions, as well as to obtain samples from a changing mixture model, fitted in a streaming fashion to data. In the setting of linear optimization, the proposed method has oracle complexity with best known dependence on the dimension for certain geometries. In the context of online learning and repeated games, the algorithm is an efficient method for implementing no-regret mixture forecasting strategies. Remarkably, in some of these examples, only one step of the random walk is needed to track the next distribution.

### Distributed Detection : Finite-time Analysis and Impact of Network Topology (with S. Shahrampour and A. Jadbabaie)

IEEE Transactions on Automatic Control, 2015

This paper addresses the problem of distributed detection in multi-agent networks. Agents receive private signals about an unknown state of the world. The underlying state is globally identifiable; however, informative signals may be dispersed throughout the network. Using an optimization-based framework, we develop an iterative local strategy for updating individual beliefs. In contrast to the existing literature which focuses on asymptotic learning, we provide a finite-time analysis. Furthermore, we introduce a Kullback-Leibler cost to compare the efficiency of the algorithm to its centralized counterpart. Our bounds on the cost are expressed in terms of network size, spectral gap, centrality of each agent and relative entropy of agents’ signal structures. A key observation is that distributing more informative signals to central agents undertakes a faster learning rate. Furthermore, modulating the communication weights among agents, we prove that spectral gap can be enhanced to speed up learning. We also study link failure in symmetric networks where we establish that disconnecting agents has detrimental effect on learning quality. We finally provide numerical simulations for our method which verifies our theoretical results.

This paper addresses the problem of distributed detection in multi-agent networks. Agents receive private signals about an unknown state of the world. The underlying state is globally identifiable; however, informative signals may be dispersed throughout the network. Using an optimization-based framework, we develop an iterative local strategy for updating individual beliefs. In contrast to the existing literature which focuses on asymptotic learning, we provide a finite-time analysis. Furthermore, we introduce a Kullback-Leibler cost to compare the efficiency of the algorithm to its centralized counterpart. Our bounds on the cost are expressed in terms of network size, spectral gap, centrality of each agent and relative entropy of agents’ signal structures. A key observation is that distributing more informative signals to central agents undertakes a faster learning rate. Furthermore, modulating the communication weights among agents, we prove that spectral gap can be enhanced to speed up learning. We also study link failure in symmetric networks where we establish that disconnecting agents has detrimental effect on learning quality. We finally provide numerical simulations for our method which verifies our theoretical results.

### On Online Optimization : Competing with Dynamic Comparators (with A. Jadbabaie, S. Shahrampour, and K. Sridharan)

AISTATS, 2015

Recent literature on online learning has focused on developing adaptive algorithms that take advantage of regularity of the sequence of observations, yet retain worst-case performance guarantees. A complementary direction is to develop prediction methods that perform well against complex benchmarks. In this paper, we address these two directions together. We present a fully adaptive method that competes with dynamic benchmarks in which regret guarantee scales with regularity of the sequence of cost functions and comparators. Notably, the regret bound adapts to the smaller complexity measure in the problem environment. Finally, we apply our results to drifting zero-sum, two-player games where both players achieve no regret guarantees against best sequences of actions in hindsight.

Recent literature on online learning has focused on developing adaptive algorithms that take advantage of regularity of the sequence of observations, yet retain worst-case performance guarantees. A complementary direction is to develop prediction methods that perform well against complex benchmarks. In this paper, we address these two directions together. We present a fully adaptive method that competes with dynamic benchmarks in which regret guarantee scales with regularity of the sequence of cost functions and comparators. Notably, the regret bound adapts to the smaller complexity measure in the problem environment. Finally, we apply our results to drifting zero-sum, two-player games where both players achieve no regret guarantees against best sequences of actions in hindsight.

### Online Nonparametric Regression with General Loss Functions (with K. Sridharan)

preprint, 2015

This paper establishes minimax rates for online regression with arbitrary classes of functions and general losses. We show that below a certain threshold for the complexity of the function class, the minimax rates depend on both the curvature of the loss function and the sequential complexities of the class. Above this threshold, the curvature of the loss does not affect the rates. Furthermore, for the case of square loss, our results point to the interesting phenomenon: whenever sequential and i.i.d. empirical entropies match, the rates for statistical and online learning are the same.

In addition to the study of minimax regret, we derive a generic forecaster that enjoys the established optimal rates. We also provide a recipe for designing online prediction algorithms that can be computationally efficient for certain problems. We illustrate the techniques by deriving existing and new forecasters for the case of finite experts and for online linear regression.

This paper establishes minimax rates for online regression with arbitrary classes of functions and general losses. We show that below a certain threshold for the complexity of the function class, the minimax rates depend on both the curvature of the loss function and the sequential complexities of the class. Above this threshold, the curvature of the loss does not affect the rates. Furthermore, for the case of square loss, our results point to the interesting phenomenon: whenever sequential and i.i.d. empirical entropies match, the rates for statistical and online learning are the same.

In addition to the study of minimax regret, we derive a generic forecaster that enjoys the established optimal rates. We also provide a recipe for designing online prediction algorithms that can be computationally efficient for certain problems. We illustrate the techniques by deriving existing and new forecasters for the case of finite experts and for online linear regression.

### Sequential Probability Assignment with Binary Alphabets and Large Classes of Experts (with K. Sridharan)

preprint, 2015

We analyze the problem of sequential probability assignment for binary outcomes with side information and logarithmic loss, where regret---or, redundancy---is measured with respect to a (possibly infinite) class of experts. We provide upper and lower bounds for minimax regret in terms of sequential complexities of the class. These complexities were recently shown to give matching (up to logarithmic factors) upper and lower bounds for sequential prediction with general convex Lipschitz loss functions (Rakhlin and Sridharan, 2015). To deal with unbounded gradients of the logarithmic loss, we present a new analysis that employs a sequential chaining technique with a Bernstein-type bound. The introduced complexities are intrinsic to the problem of sequential probability assignment, as illustrated by our lower bound.

We also consider an example of a large class of experts parametrized by vectors in a high-dimensional Euclidean ball (or a Hilbert ball). The typical discretization approach fails, while our techniques give a non-trivial bound. For this problem we also present an algorithm based on regularization with a self-concordant barrier. This algorithm is of an independent interest, as it requires a bound on the function values rather than gradients.

We analyze the problem of sequential probability assignment for binary outcomes with side information and logarithmic loss, where regret---or, redundancy---is measured with respect to a (possibly infinite) class of experts. We provide upper and lower bounds for minimax regret in terms of sequential complexities of the class. These complexities were recently shown to give matching (up to logarithmic factors) upper and lower bounds for sequential prediction with general convex Lipschitz loss functions (Rakhlin and Sridharan, 2015). To deal with unbounded gradients of the logarithmic loss, we present a new analysis that employs a sequential chaining technique with a Bernstein-type bound. The introduced complexities are intrinsic to the problem of sequential probability assignment, as illustrated by our lower bound.

We also consider an example of a large class of experts parametrized by vectors in a high-dimensional Euclidean ball (or a Hilbert ball). The typical discretization approach fails, while our techniques give a non-trivial bound. For this problem we also present an algorithm based on regularization with a self-concordant barrier. This algorithm is of an independent interest, as it requires a bound on the function values rather than gradients.

### On Martingale Extensions of Vapnik-Chervonenkis Theory with Applications to Online Learning (with K. Sridharan)

Festschrift in honor of A. Chervonenkis, 2014

We review recent advances on uniform martingale laws of large numbers and the associated sequential complexity measures. These results may be considered as forming a non-i.i.d. generalization of Vapnik-Chervonenkis theory. We discuss applications to online learning, provide a recipe for designing online learning algorithms, and illustrate the techniques on the problem of online node classification. We outline connections to statistical learning theory and discuss inductive principles of stochastic approximation and empirical risk minimization.

We review recent advances on uniform martingale laws of large numbers and the associated sequential complexity measures. These results may be considered as forming a non-i.i.d. generalization of Vapnik-Chervonenkis theory. We discuss applications to online learning, provide a recipe for designing online learning algorithms, and illustrate the techniques on the problem of online node classification. We outline connections to statistical learning theory and discuss inductive principles of stochastic approximation and empirical risk minimization.

### Online Nonparametric Regression (with K. Sridharan)

COLT, 2014

We establish optimal rates for online regression for arbitrary classes of regression functions in terms of the sequential entropy introduced in (Rakhlin, Sridharan, Tewari, 2010). The optimal rates are shown to exhibit a phase transition analogous to the i.i.d./statistical learning case, studied in (Rakhlin, Sridharan, Tsybakov 2013). In the frequently encountered situation when sequential entropy and i.i.d. empirical entropy match, our results point to the interesting phenomenon that the rates for statistical learning with squared loss and online nonparametric regression are the same.

In addition to a non-algorithmic study of minimax regret, we exhibit a generic forecaster that enjoys the established optimal rates. We also provide a recipe for designing online regression algorithms that can be computationally efficient. We illustrate the techniques by deriving existing and new forecasters for the case of finite experts and for online linear regression.

We establish optimal rates for online regression for arbitrary classes of regression functions in terms of the sequential entropy introduced in (Rakhlin, Sridharan, Tewari, 2010). The optimal rates are shown to exhibit a phase transition analogous to the i.i.d./statistical learning case, studied in (Rakhlin, Sridharan, Tsybakov 2013). In the frequently encountered situation when sequential entropy and i.i.d. empirical entropy match, our results point to the interesting phenomenon that the rates for statistical learning with squared loss and online nonparametric regression are the same.

In addition to a non-algorithmic study of minimax regret, we exhibit a generic forecaster that enjoys the established optimal rates. We also provide a recipe for designing online regression algorithms that can be computationally efficient. We illustrate the techniques by deriving existing and new forecasters for the case of finite experts and for online linear regression.

### Entropy, Minimax Regret and Minimax Risk (with K. Sridharan and A. Tsybakov)

Bernoulli Journal, 2014

We consider the random design regression model with square loss. We propose a method that aggregates empirical minimizers (ERM) over appropriately chosen random subsets and reduces to ERM in the extreme case, and we establish sharp oracle inequalities for its risk. We show that, under the $\epsilon^{-p}$ growth of the empirical $\epsilon$-entropy, the excess risk of the proposed method attains the rate $n^{-\frac{2}{2+p}}$ for $p\in(0,2]$ and $n^{-1/p}$ for $p> 2$ where $n$ is the sample size. Furthermore, for $p\in(0,2]$, the excess risk rate matches the behavior of the minimax risk of function estimation in regression problems under the well-specified model. This yields a conclusion that the rates of statistical estimation in well-specified models (minimax risk) and in misspecified models (minimax regret) are equivalent in the regime $p\in(0,2]$. In other words, for $p\in(0,2]$ the problem of statistical learning enjoys the same minimax rate as the problem of statistical estimation. On the contrary, for $p>2$ we show that the rates of the minimax regret are, in general, slower than for the minimax risk. Our oracle inequalities also imply the $v\log(n/v)/n$ rates for Vapnik-Chervonenkis type classes of dimension $v$ without the usual convexity assumption on the class; we show that these rates are optimal. Finally, for a slightly modified method, we derive a bound on the excess risk of $s$-sparse convex aggregation improving that of (Lounici, 07) and providing the optimal rate.

We consider the random design regression model with square loss. We propose a method that aggregates empirical minimizers (ERM) over appropriately chosen random subsets and reduces to ERM in the extreme case, and we establish sharp oracle inequalities for its risk. We show that, under the $\epsilon^{-p}$ growth of the empirical $\epsilon$-entropy, the excess risk of the proposed method attains the rate $n^{-\frac{2}{2+p}}$ for $p\in(0,2]$ and $n^{-1/p}$ for $p> 2$ where $n$ is the sample size. Furthermore, for $p\in(0,2]$, the excess risk rate matches the behavior of the minimax risk of function estimation in regression problems under the well-specified model. This yields a conclusion that the rates of statistical estimation in well-specified models (minimax risk) and in misspecified models (minimax regret) are equivalent in the regime $p\in(0,2]$. In other words, for $p\in(0,2]$ the problem of statistical learning enjoys the same minimax rate as the problem of statistical estimation. On the contrary, for $p>2$ we show that the rates of the minimax regret are, in general, slower than for the minimax risk. Our oracle inequalities also imply the $v\log(n/v)/n$ rates for Vapnik-Chervonenkis type classes of dimension $v$ without the usual convexity assumption on the class; we show that these rates are optimal. Finally, for a slightly modified method, we derive a bound on the excess risk of $s$-sparse convex aggregation improving that of (Lounici, 07) and providing the optimal rate.

### Sequential Complexities and Uniform Martingale Laws of Large Numbers (with K. Sridharan and A. Tewari)

Probability Theory and Related Fields, 2014

We establish necessary and sufficient conditions for a uniform martingale Law of Large Numbers. We extend the technique of symmetrization to the case of dependent random variables and provide ``sequential'' (non-i.i.d.) analogues of various classical measures of complexity, such as covering numbers and combinatorial dimensions from empirical process theory. We establish relationships between these various sequential complexity measures and show that they provide a tight control on the uniform convergence rates for empirical processes with dependent data. As a direct application of our results, we provide exponential inequalities for sums of martingale differences in Banach spaces.

We demonstrate the utility of these results in two domains. First, we consider the problem of sequential prediction. Analogous to the role of classical empirical process theory in statistical learning (with i.i.d. data), the developed theory is shown to yield precise learning guarantees for the problem of sequential prediction. In particular, the minimax learning rate is shown to be tightly controlled by the universal uniform convergence rates for empirical processes. As a second (direct) application of our results, we provide exponential inequalities for sums of martingale difference sequences in Banach spaces.

We establish necessary and sufficient conditions for a uniform martingale Law of Large Numbers. We extend the technique of symmetrization to the case of dependent random variables and provide ``sequential'' (non-i.i.d.) analogues of various classical measures of complexity, such as covering numbers and combinatorial dimensions from empirical process theory. We establish relationships between these various sequential complexity measures and show that they provide a tight control on the uniform convergence rates for empirical processes with dependent data. As a direct application of our results, we provide exponential inequalities for sums of martingale differences in Banach spaces.

We demonstrate the utility of these results in two domains. First, we consider the problem of sequential prediction. Analogous to the role of classical empirical process theory in statistical learning (with i.i.d. data), the developed theory is shown to yield precise learning guarantees for the problem of sequential prediction. In particular, the minimax learning rate is shown to be tightly controlled by the universal uniform convergence rates for empirical processes. As a second (direct) application of our results, we provide exponential inequalities for sums of martingale difference sequences in Banach spaces.

### Online Learning via Sequential Complexities (with K. Sridharan and A. Tewari)

Journal of Machine Learning Research, 2014

We consider the problem of sequential prediction and provide tools to study the minimax value of the associated game. Classical statistical learning theory provides several useful complexity measures to study learning with i.i.d. data. Our proposed sequential complexities can be seen as extensions of these measures to the sequential setting. The developed theory is shown to yield precise learning guarantees for the problem of sequential prediction. In particular, we show necessary and sufficient conditions for online learnability in the setting of supervised learning. Several examples show the utility of our framework: we can establish learnability without having to exhibit an explicit online learning algorithm.

We consider the problem of sequential prediction and provide tools to study the minimax value of the associated game. Classical statistical learning theory provides several useful complexity measures to study learning with i.i.d. data. Our proposed sequential complexities can be seen as extensions of these measures to the sequential setting. The developed theory is shown to yield precise learning guarantees for the problem of sequential prediction. In particular, we show necessary and sufficient conditions for online learnability in the setting of supervised learning. Several examples show the utility of our framework: we can establish learnability without having to exhibit an explicit online learning algorithm.

### Partial monitoring -- classification, regret bounds, and algorithms (with G. Bartók, D. Foster, D. Pál, and C. Szepesvári)

Mathematics of Operations Research, 2014

In a partial monitoring game, the learner repeatedly chooses an action, the environment responds with an outcome, and then the learner suffers a loss and receives a feedback signal, both of which are fixed functions of the action and the outcome. The goal of the learner is to minimize his regret, which is the difference between his total cumulative loss and the total loss of the best fixed action in hindsight. In this paper we characterize the minimax regret of any partial monitoring game with finitely many actions and outcomes. It turns out that the minimax regret of any such game is either zero, Θ(√ T ), Θ(T^{2/3}), or Θ(T). We provide computationally efficient learning algorithms that achieve the minimax regret within logarithmic factor for any game. In addition to the bounds on the minimax regret, if we assume that the outcomes are generated in an i.i.d. fashion, we prove individual upper bounds on the expected regret.

In a partial monitoring game, the learner repeatedly chooses an action, the environment responds with an outcome, and then the learner suffers a loss and receives a feedback signal, both of which are fixed functions of the action and the outcome. The goal of the learner is to minimize his regret, which is the difference between his total cumulative loss and the total loss of the best fixed action in hindsight. In this paper we characterize the minimax regret of any partial monitoring game with finitely many actions and outcomes. It turns out that the minimax regret of any such game is either zero, Θ(√ T ), Θ(T^{2/3}), or Θ(T). We provide computationally efficient learning algorithms that achieve the minimax regret within logarithmic factor for any game. In addition to the bounds on the minimax regret, if we assume that the outcomes are generated in an i.i.d. fashion, we prove individual upper bounds on the expected regret.

### On Zeroth-Order Stochastic Convex Optimization via Random Walks (with T. Liang and H. Narayanan)

preprint, 2014

We propose a method for zeroth order stochastic convex optimization that attains the suboptimality rate of n^{7}T^{-1/2} after T queries for a convex bounded function f:R^n \to R. The method is based on a random walk (the

We propose a method for zeroth order stochastic convex optimization that attains the suboptimality rate of n^{7}T^{-1/2} after T queries for a convex bounded function f:R^n \to R. The method is based on a random walk (the

*Ball Walk*) on the epigraph of the function. The randomized approach circumvents the problem of gradient estimation, and appears to be less sensitive to noisy function evaluations compared to noiseless zeroth order methods.

### On Semi-Probabilistic Universal Prediction (with K. Sridharan)

IEEE Information Theory Workshop, 2013

We discuss two scenarios of universal prediction, as well as some recent advances in the study of minimax regret and algorithmic development. We then propose an intermediate scenario, the Semi-Probabilistic Setting, and make progress towards understanding the associated minimax regret.

We discuss two scenarios of universal prediction, as well as some recent advances in the study of minimax regret and algorithmic development. We then propose an intermediate scenario, the Semi-Probabilistic Setting, and make progress towards understanding the associated minimax regret.

### Optimization, Learning, and Games with Predictable Sequences (with K. Sridharan)

NIPS, 2013

We provide several applications of Optimistic Mirror Descent, an online learning algorithm based on the idea of predictable sequences. First, we recover the Mirror Prox algorithm for offline optimization, prove an extension to Holder-smooth functions, and apply the results to saddle-point type problems. Next, we prove that a version of Optimistic Mirror Descent (which has a close relation to the Exponential Weights algorithm) can be used by two strongly-uncoupled players in a finite zero-sum matrix game to converge to the minimax equilibrium at the rate of O((log T)/T). This addresses a question of Daskalakis et al 2011. Further, we consider a partial information version of the problem. We then apply the results to convex programming and exhibit a simple algorithm for the approximate Max Flow problem.

We provide several applications of Optimistic Mirror Descent, an online learning algorithm based on the idea of predictable sequences. First, we recover the Mirror Prox algorithm for offline optimization, prove an extension to Holder-smooth functions, and apply the results to saddle-point type problems. Next, we prove that a version of Optimistic Mirror Descent (which has a close relation to the Exponential Weights algorithm) can be used by two strongly-uncoupled players in a finite zero-sum matrix game to converge to the minimax equilibrium at the rate of O((log T)/T). This addresses a question of Daskalakis et al 2011. Further, we consider a partial information version of the problem. We then apply the results to convex programming and exhibit a simple algorithm for the approximate Max Flow problem.

### Online Learning of Dynamic Parameters in Social Networks (with S. Shahrampour and A. Jadbabaie)

NIPS, 2013

This paper addresses the problem of online learning in a dynamic setting. We consider a social network in which each individual observes a private signal about the underlying state of the world and communicates with her neighbors at each time period. Unlike many existing approaches, the underlying state is dynamic, and evolves according to a geometric random walk. We view the scenario as an optimization problem where agents aim to learn the true state while suffering the smallest possible loss. Based on the decomposition of the global loss function, we introduce two update mechanisms, each of which generates an estimate of the true state. We establish a tight bound on the rate of change of the underlying state, under which individuals can track the parameter with a bounded variance. Then, we characterize explicit expressions for the steady state mean-square deviation(MSD) of the estimates from the truth, per individual. We observe that only one of the estimators recovers the optimal MSD, which underscores the impact of the objective function decomposition on the learning quality. Finally, we provide an upper bound on the regret of the proposed methods, measured as an average of errors in estimating the parameter in a finite time.

This paper addresses the problem of online learning in a dynamic setting. We consider a social network in which each individual observes a private signal about the underlying state of the world and communicates with her neighbors at each time period. Unlike many existing approaches, the underlying state is dynamic, and evolves according to a geometric random walk. We view the scenario as an optimization problem where agents aim to learn the true state while suffering the smallest possible loss. Based on the decomposition of the global loss function, we introduce two update mechanisms, each of which generates an estimate of the true state. We establish a tight bound on the rate of change of the underlying state, under which individuals can track the parameter with a bounded variance. Then, we characterize explicit expressions for the steady state mean-square deviation(MSD) of the estimates from the truth, per individual. We observe that only one of the estimators recovers the optimal MSD, which underscores the impact of the objective function decomposition on the learning quality. Finally, we provide an upper bound on the regret of the proposed methods, measured as an average of errors in estimating the parameter in a finite time.

### Competing with Strategies (with W. Han and K. Sridharan)

COLT, 2013

We study the problem of online learning with a notion of regret defined with respect to a set of strategies. We develop tools for analyzing the minimax rates and for deriving regret-minimization algorithms in this scenario. While the standard methods for minimizing the usual notion of regret fail, through our analysis we demonstrate existence of regret-minimization methods that compete with such sets of strategies as: autoregressive algorithms, strategies based on statistical models, regularized least squares, and follow the regularized leader strategies. In several cases we also derive efficient learning algorithms.

We study the problem of online learning with a notion of regret defined with respect to a set of strategies. We develop tools for analyzing the minimax rates and for deriving regret-minimization algorithms in this scenario. While the standard methods for minimizing the usual notion of regret fail, through our analysis we demonstrate existence of regret-minimization methods that compete with such sets of strategies as: autoregressive algorithms, strategies based on statistical models, regularized least squares, and follow the regularized leader strategies. In several cases we also derive efficient learning algorithms.

### Online Learning with Predictable Sequences (with K. Sridharan)

COLT, 2013

We present methods for online linear optimization that take advantage of benign (as opposed to worst-case) sequences. Specifically if the sequence encountered by the learner is described well by a known ``predictable process'', the algorithms presented enjoy tighter bounds as compared to the typical worst case bounds. Additionally, the methods achieve the usual worst-case regret bounds if the sequence is not benign. Our approach can be seen as a way of adding prior knowledge about the sequence within the paradigm of online learning. The setting is shown to encompass partial and side information. Variance and path-length bounds (Hazan and Kale, 2010, Chiang et al 2012) can be seen as particular examples of online learning with simple predictable sequences.

We further extend our methods and results to include competing with a set of possible predictable processes (models), that is ``learning'' the predictable process itself concurrently with using it to obtain better regret guarantees. We show that such model selection is possible under various assumptions on the available feedback. Our results suggest a promising direction of further research with potential applications to stock market and time series prediction.

We present methods for online linear optimization that take advantage of benign (as opposed to worst-case) sequences. Specifically if the sequence encountered by the learner is described well by a known ``predictable process'', the algorithms presented enjoy tighter bounds as compared to the typical worst case bounds. Additionally, the methods achieve the usual worst-case regret bounds if the sequence is not benign. Our approach can be seen as a way of adding prior knowledge about the sequence within the paradigm of online learning. The setting is shown to encompass partial and side information. Variance and path-length bounds (Hazan and Kale, 2010, Chiang et al 2012) can be seen as particular examples of online learning with simple predictable sequences.

We further extend our methods and results to include competing with a set of possible predictable processes (models), that is ``learning'' the predictable process itself concurrently with using it to obtain better regret guarantees. We show that such model selection is possible under various assumptions on the available feedback. Our results suggest a promising direction of further research with potential applications to stock market and time series prediction.

### Localization and Adaptation in Online Learning (with O. Shamir and K. Sridharan)

AISTATS, 2013

We introduce a formalism of localization for online learning problems. Similarly to statistical learning theory, localization can be used to obtain fast rates. We introduce local sequential Rademacher complexities and other local measures. Based on the idea of relaxations for deriving algorithms, we provide a template method that takes advantage of localization. Furthermore, we build a general adaptive method that can take advantage of the suboptimality of the observed sequence. We illustrate the utility of the introduced concepts on several problems. Among them is an upper bound on regret in terms of classical Rademacher complexity when the data are i.i.d.

We introduce a formalism of localization for online learning problems. Similarly to statistical learning theory, localization can be used to obtain fast rates. We introduce local sequential Rademacher complexities and other local measures. Based on the idea of relaxations for deriving algorithms, we provide a template method that takes advantage of localization. Furthermore, we build a general adaptive method that can take advantage of the suboptimality of the observed sequence. We illustrate the utility of the introduced concepts on several problems. Among them is an upper bound on regret in terms of classical Rademacher complexity when the data are i.i.d.

### Stochastic Convex Optimization with Bandit Feedback (with A. Agarwal, D. Foster, D. Hsu, and S. Kakade)

SIAM Journal on Optimization, 2013

This paper addresses the problem of minimizing a convex, Lipschitz function f over a convex, compact set X under a stochastic bandit feedback model. In this model, the algorithm is allowed to observed noisy realizations of the function value f(x) at any query point x in X. The quantity of interest is regret of the algorithm, which is the sum of the function values at algorithm's query points minus the optimal function value. We demonstrate a generalization of the ellipsoid algorithm that incurs poly(d)√ T regret. Since any algorithm has regret at least Ω(√ T ) on this problem, our algorithm is optimal in terms of the scaling with T.

This paper addresses the problem of minimizing a convex, Lipschitz function f over a convex, compact set X under a stochastic bandit feedback model. In this model, the algorithm is allowed to observed noisy realizations of the function value f(x) at any query point x in X. The quantity of interest is regret of the algorithm, which is the sum of the function values at algorithm's query points minus the optimal function value. We demonstrate a generalization of the ellipsoid algorithm that incurs poly(d)√ T regret. Since any algorithm has regret at least Ω(√ T ) on this problem, our algorithm is optimal in terms of the scaling with T.

### Relax and Randomize: From Value to Algorithms (with O. Shamir and K. Sridharan)

NIPS, 2012

We show a principled way of deriving online learning algorithms from a minimax analysis. Various upper bounds on the minimax value, previously thought to be non-constructive, are shown to yield algorithms. This allows us to seamlessly recover known methods and to derive new ones. Our framework also captures such ``unorthodox'' methods as Follow the Perturbed Leader and the R^2 forecaster. We emphasize that understanding the inherent complexity of the learning problem leads to the development of algorithms.

We define

We present a number of new algorithms, including a family of randomized methods that use the idea of a ``random playout''. Several new versions of the Follow-the-Perturbed-Leader algorithms are presented, as well as methods based on the Littlestone's dimension, efficient methods for matrix completion with trace norm, and algorithms for the problems of transductive learning and prediction with static experts.

We show a principled way of deriving online learning algorithms from a minimax analysis. Various upper bounds on the minimax value, previously thought to be non-constructive, are shown to yield algorithms. This allows us to seamlessly recover known methods and to derive new ones. Our framework also captures such ``unorthodox'' methods as Follow the Perturbed Leader and the R^2 forecaster. We emphasize that understanding the inherent complexity of the learning problem leads to the development of algorithms.

We define

*local*sequential Rademacher complexities and associated algorithms that allow us to obtain faster rates in online learning, similarly to statistical learning theory. Based on these localized complexities we build a general adaptive method that can take advantage of the suboptimality of the observed sequence.

We present a number of new algorithms, including a family of randomized methods that use the idea of a ``random playout''. Several new versions of the Follow-the-Perturbed-Leader algorithms are presented, as well as methods based on the Littlestone's dimension, efficient methods for matrix completion with trace norm, and algorithms for the problems of transductive learning and prediction with static experts.

### Making Stochastic Gradient Descent Optimal for Strongly Convex Problems (with O. Shamir and K. Sridharan)

ICML, 2012

Stochastic gradient descent (SGD) is a simple and popular method to solve stochastic optimization problems which arise in machine learning. For strongly convex problems, its convergence rate was known to be O(log(T)/T), by running SGD for T iterations and returning the average point. However, recent results showed that using a different algorithm, one can get an optimal O(1/T) rate. This might lead one to believe that standard SGD is suboptimal, and maybe should even be replaced as a method of choice. In this paper, we investigate the optimality of SGD in a stochastic setting. We show that for smooth problems, the algorithm attains the optimal O(1/T) rate. However, for non-smooth problems, the convergence rate with averaging might really be Ω(log(T)/T), and this is not just an artifact of the analysis. On the flip side, we show that a simple modification of the averaging step suffices to recover the O(1/T) rate, and no other change of the algorithm is necessary. We also present experimental results which support our findings, and point out open problems.

Stochastic gradient descent (SGD) is a simple and popular method to solve stochastic optimization problems which arise in machine learning. For strongly convex problems, its convergence rate was known to be O(log(T)/T), by running SGD for T iterations and returning the average point. However, recent results showed that using a different algorithm, one can get an optimal O(1/T) rate. This might lead one to believe that standard SGD is suboptimal, and maybe should even be replaced as a method of choice. In this paper, we investigate the optimality of SGD in a stochastic setting. We show that for smooth problems, the algorithm attains the optimal O(1/T) rate. However, for non-smooth problems, the convergence rate with averaging might really be Ω(log(T)/T), and this is not just an artifact of the analysis. On the flip side, we show that a simple modification of the averaging step suffices to recover the O(1/T) rate, and no other change of the algorithm is necessary. We also present experimental results which support our findings, and point out open problems.

### Interior-Point Methods for Full-Information and Bandit Online Learning (with J. Abernethy and E. Hazan)

IEEE Transactions on Information Theory, 2012

We study the problem of predicting individual sequences with linear loss with full and partial (or bandit) feed- back. Our main contribution is the first efficient algorithm for the problem of online linear optimization in the bandit setting which achieves the optimal Õ(√(T)) regret. In addition, for the full-information setting, we give a novel regret minimization algorithm. These results are made possible by the introduction of interior-point methods for convex optimization to online learning.

We study the problem of predicting individual sequences with linear loss with full and partial (or bandit) feed- back. Our main contribution is the first efficient algorithm for the problem of online linear optimization in the bandit setting which achieves the optimal Õ(√(T)) regret. In addition, for the full-information setting, we give a novel regret minimization algorithm. These results are made possible by the introduction of interior-point methods for convex optimization to online learning.

### No Internal Regret via Neighborhood Watch (with D. Foster)

AISTATS, 2012

We present an algorithm which attains O(√ T ) internal (and thus external) regret for finite games with partial monitoring under the

We present an algorithm which attains O(√ T ) internal (and thus external) regret for finite games with partial monitoring under the

*local observability condition*. Recently, this condition has been shown by (Bartók, Pál, and Szepesvári 2011) to imply the O(√ T ) rate for partial monitoring games against an i.i.d. opponent, and the authors conjectured that the same holds for non-stochastic adversaries. Our result is in the affirmative, and it completes the characterization of possible rates for finite partial-monitoring games, an open question stated by (Cesa-Bianchi, Lugosi, and Stoltz 2006). Our regret guarantees also hold for the more general model of partial monitoring with random signals.

### Lower Bounds for Passive and Active Learning (with M. Raginsky)

NIPS, 2011

We develop unified information-theoretic machinery for deriving lower bounds for passive and active learning schemes. Our bounds involve the so-called Alexander's capacity function. The supremum of this function has been recently rediscovered by Hanneke in the context of active learning under the name of ``disagreement coefficient." For passive learning, our lower bounds match the upper bounds of Gine and Koltchinskii up to constants and generalize analogous results of Massart and Nédélec. For active learning, we provide first known lower bounds based on the capacity function rather than the disagreement coefficient.

We develop unified information-theoretic machinery for deriving lower bounds for passive and active learning schemes. Our bounds involve the so-called Alexander's capacity function. The supremum of this function has been recently rediscovered by Hanneke in the context of active learning under the name of ``disagreement coefficient." For passive learning, our lower bounds match the upper bounds of Gine and Koltchinskii up to constants and generalize analogous results of Massart and Nédélec. For active learning, we provide first known lower bounds based on the capacity function rather than the disagreement coefficient.

### Complexity-Based Approach to Calibration with Checking Rules (with D. Foster, K. Sridharan and A. Tewari)

COLT, 2011

We consider the problem of forecasting a sequence of outcomes from an unknown source. The quality of the forecaster is measured by a family of checking rules. We prove upper bounds on the value of the associated game, thus certifying the existence of a calibrated strategy for the forecaster. We show that complexity of the family of checking rules can be captured by the notion of a sequential cover introduced in (Rakhlin, Sridharan, Tewari, 2010). Various natural assumptions on the class of checking rules are considered, including finiteness of Vapnik-Chervonenkis and Littlestone's dimensions.

We consider the problem of forecasting a sequence of outcomes from an unknown source. The quality of the forecaster is measured by a family of checking rules. We prove upper bounds on the value of the associated game, thus certifying the existence of a calibrated strategy for the forecaster. We show that complexity of the family of checking rules can be captured by the notion of a sequential cover introduced in (Rakhlin, Sridharan, Tewari, 2010). Various natural assumptions on the class of checking rules are considered, including finiteness of Vapnik-Chervonenkis and Littlestone's dimensions.

### Online Learning: Stochastic and Constrained Adversaries (with K. Sridharan and A. Tewari)

NIPS, 2011

Learning theory has largely focused on two main learning scenarios. The first is the classical statistical setting where instances are drawn i.i.d. from a fixed distribution and the second scenario is the online learning, completely adversarial scenario where adversary at every time step picks the worst instance to provide the learner with. It can be argued that in the real world neither of these assumptions are reasonable. It is therefore important to study problems with a range of assumptions on data. Unfortunately, theoretical results in this area are scarce, possibly due to absence of general tools for analysis. Focusing on the regret formulation, we define the minimax value of a game where the adversary is restricted in his moves. The framework captures stochastic and non-stochastic assumptions on data. Building on the sequential symmetrization approach, we define a notion of distribution-dependent Rademacher complexity for the spectrum of problems ranging from i.i.d. to worst-case. The bounds let us immediately deduce variation-type bounds. We then consider the i.i.d. adversary and show equivalence of online and batch learnability. In the supervised setting, we consider various hybrid assumptions on the way that x and y variables are chosen. Finally, we consider smoothed learning problems and show that half-spaces are online learnable in the smoothed model. In fact, exponentially small noise added to adversary's decisions turns this problem with infinite Littlestone's dimension into a learnable problem.

Learning theory has largely focused on two main learning scenarios. The first is the classical statistical setting where instances are drawn i.i.d. from a fixed distribution and the second scenario is the online learning, completely adversarial scenario where adversary at every time step picks the worst instance to provide the learner with. It can be argued that in the real world neither of these assumptions are reasonable. It is therefore important to study problems with a range of assumptions on data. Unfortunately, theoretical results in this area are scarce, possibly due to absence of general tools for analysis. Focusing on the regret formulation, we define the minimax value of a game where the adversary is restricted in his moves. The framework captures stochastic and non-stochastic assumptions on data. Building on the sequential symmetrization approach, we define a notion of distribution-dependent Rademacher complexity for the spectrum of problems ranging from i.i.d. to worst-case. The bounds let us immediately deduce variation-type bounds. We then consider the i.i.d. adversary and show equivalence of online and batch learnability. In the supervised setting, we consider various hybrid assumptions on the way that x and y variables are chosen. Finally, we consider smoothed learning problems and show that half-spaces are online learnable in the smoothed model. In fact, exponentially small noise added to adversary's decisions turns this problem with infinite Littlestone's dimension into a learnable problem.

### Information-Based Complexity, Feedback, and Dynamics in Convex Programming (with M. Raginsky)

IEEE Transactions on Information Theory, 2011

We study the intrinsic limitations of sequential convex optimization through the lens of feedback information theory. In the oracle model of optimization, an algorithm queries an oracle for noisy information about the unknown objective function, and the goal is to (approximately) minimize every function in a given class using as few queries as possible. We show that, in order for a function to be optimized, the algorithm must be able to accumulate enough information about the objective. This, in turn, puts limits on the speed of optimization under specific assumptions on the oracle and the type of feedback. Our techniques are akin to the ones used in statistical literature to obtain minimax lower bounds on the risks of estimation procedures; the notable difference is that, unlike in the case of i.i.d. data, a sequential optimization algorithm can gather observations in a controlled manner, so that the amount of information at each step is allowed to change in time. In particular, we show that optimization algorithms often obey the law of diminishing returns: the signal-to-noise ratio drops as the optimization algorithm approaches the optimum. To underscore the generality of the tools, we use our approach to derive fundamental lower bounds for a certain active learning problem. Overall, the present work connects the intuitive notions of ``information'' in optimization, experimental design, estimation, and active learning to the quantitative notion of Shannon information.

We study the intrinsic limitations of sequential convex optimization through the lens of feedback information theory. In the oracle model of optimization, an algorithm queries an oracle for noisy information about the unknown objective function, and the goal is to (approximately) minimize every function in a given class using as few queries as possible. We show that, in order for a function to be optimized, the algorithm must be able to accumulate enough information about the objective. This, in turn, puts limits on the speed of optimization under specific assumptions on the oracle and the type of feedback. Our techniques are akin to the ones used in statistical literature to obtain minimax lower bounds on the risks of estimation procedures; the notable difference is that, unlike in the case of i.i.d. data, a sequential optimization algorithm can gather observations in a controlled manner, so that the amount of information at each step is allowed to change in time. In particular, we show that optimization algorithms often obey the law of diminishing returns: the signal-to-noise ratio drops as the optimization algorithm approaches the optimum. To underscore the generality of the tools, we use our approach to derive fundamental lower bounds for a certain active learning problem. Overall, the present work connects the intuitive notions of ``information'' in optimization, experimental design, estimation, and active learning to the quantitative notion of Shannon information.

### Online Learning: Beyond Regret (with K. Sridharan and A. Tewari)

COLT, 2011

We study online learnability of a wide class of problems, extending the results of (Rakhlin, Sridharan, Tewari, 2010) to general notions of performance measure well beyond external regret. Our framework simultaneously captures such well-known notions as internal and general Phi-regret, learning with non-additive global cost functions, Blackwell's approachability, calibration of forecasters, adaptive regret, and more. We show that learnability in all these situations is due to control of the same three quantities: a martingale convergence term, a term describing the ability to perform well if future is known, and a generalization of sequential Rademacher complexity, studied in (Rakhlin, Sridharan, Tewari, 2010). Since we directly study complexity of the problem instead of focusing on efficient algorithms, we are able to improve and extend many known results which have been previously derived via an algorithmic construction.

We study online learnability of a wide class of problems, extending the results of (Rakhlin, Sridharan, Tewari, 2010) to general notions of performance measure well beyond external regret. Our framework simultaneously captures such well-known notions as internal and general Phi-regret, learning with non-additive global cost functions, Blackwell's approachability, calibration of forecasters, adaptive regret, and more. We show that learnability in all these situations is due to control of the same three quantities: a martingale convergence term, a term describing the ability to perform well if future is known, and a generalization of sequential Rademacher complexity, studied in (Rakhlin, Sridharan, Tewari, 2010). Since we directly study complexity of the problem instead of focusing on efficient algorithms, we are able to improve and extend many known results which have been previously derived via an algorithmic construction.

### Online Learning: Random Averages, Combinatorial Parameters, and Learnability (with K. Sridharan and A. Tewari)

NIPS, 2010

We study learnability in the online learning model. We define several complexity measures which capture the difficulty of learning in a sequential manner. Among these measures are analogues of Rademacher complexity, covering numbers and fat shattering dimension from statistical learning theory. Relationship among these complexity measures, their connection to online learning, and tools for bounding them are provided. In the setting of supervised learning, finiteness of the introduced scale-sensitive parameters is shown to be equivalent to learnability. The complexities we define also ensure uniform convergence for non-i.i.d. data, extending the uniform Glivenko-Cantelli type results. We conclude by showing online learnability for an array of examples.

We study learnability in the online learning model. We define several complexity measures which capture the difficulty of learning in a sequential manner. Among these measures are analogues of Rademacher complexity, covering numbers and fat shattering dimension from statistical learning theory. Relationship among these complexity measures, their connection to online learning, and tools for bounding them are provided. In the setting of supervised learning, finiteness of the introduced scale-sensitive parameters is shown to be equivalent to learnability. The complexities we define also ensure uniform convergence for non-i.i.d. data, extending the uniform Glivenko-Cantelli type results. We conclude by showing online learnability for an array of examples.

### Random Walk Approach to Regret Minimization (with H. Narayanan)

NIPS, 2010

We propose a computationally efficient random walk on a convex body which rapidly mixes to a time-varying Gibbs distribution. In the setting of online convex optimization and repeated games, the algorithm yields low regret and presents a novel efficient method for implementing mixture forecasting strategies.

We propose a computationally efficient random walk on a convex body which rapidly mixes to a time-varying Gibbs distribution. In the setting of online convex optimization and repeated games, the algorithm yields low regret and presents a novel efficient method for implementing mixture forecasting strategies.

### Online Convex Programming in Adaptive Control (with M. Raginsky and S. Yüksel)

IEEE Conference on Decision and Control, 2010

Online Convex Programming (OCP) is a recently developed model of sequential decision-making in the presence of time-varying uncertainty. In this framework, a decision-maker selects points in a convex feasible set to respond to a dynamically changing sequence of convex cost functions. A generic algorithm for OCP, often with provably optimal performance guarantees, is inspired by the Method of Mirror Descent (MD) developed by Nemirovski and Yudin in the 1970's. This paper highlights OCP as a common theme in adaptive control, both in its classical variant based on parameter tuning and in a more modern supervisory approach. Specifically, we show that: (1) MD leads to a generalization of classical adaptive control schemes based on recursive parameter tuning; (2) A supervisory controller switching policy that uses OCP to estimate system parameters from a sequence of appropriately regularized output prediction errors can flexibly adapt to presence or absence of output disturbances in the system.

Online Convex Programming (OCP) is a recently developed model of sequential decision-making in the presence of time-varying uncertainty. In this framework, a decision-maker selects points in a convex feasible set to respond to a dynamically changing sequence of convex cost functions. A generic algorithm for OCP, often with provably optimal performance guarantees, is inspired by the Method of Mirror Descent (MD) developed by Nemirovski and Yudin in the 1970's. This paper highlights OCP as a common theme in adaptive control, both in its classical variant based on parameter tuning and in a more modern supervisory approach. Specifically, we show that: (1) MD leads to a generalization of classical adaptive control schemes based on recursive parameter tuning; (2) A supervisory controller switching policy that uses OCP to estimate system parameters from a sequence of appropriately regularized output prediction errors can flexibly adapt to presence or absence of output disturbances in the system.

### Quantitative Analysis of Systems Using Game-Theoretic Learning (with S. Seshia)

ACM Transactions on Embedded Computing Systems, 2010

The analysis of quantitative properties, such as timing and power, is central to the design of reliable embedded software and systems. However, the verification of such properties on a program is made difficult by their heavy dependence on the program's environment, such as the processor it runs on. Modeling the environment by hand can be tedious, error-prone and time consuming. In this paper, we present a new, game-theoretic approach to analyzing quantitative properties that is based on performing systematic measurements to automatically learn a model of the environment. We model the problem as a game between our algorithm (player) and the environment of the program (adversary), where the player seeks to accurately predict the property of interest while the adversary sets environment states and parameters. To solve this problem, we employ a randomized strategy that repeatedly tests the program along a linear-sized set of program paths called basis paths, using the resulting measurements to infer a weighted-graph model of the environment, from which quantitative properties can be predicted. We prove that our algorithm can, under certain assumptions and with arbitrarily high probability, accurately predict properties such as worst-case execution time or estimate the distribution of execution times.

The analysis of quantitative properties, such as timing and power, is central to the design of reliable embedded software and systems. However, the verification of such properties on a program is made difficult by their heavy dependence on the program's environment, such as the processor it runs on. Modeling the environment by hand can be tedious, error-prone and time consuming. In this paper, we present a new, game-theoretic approach to analyzing quantitative properties that is based on performing systematic measurements to automatically learn a model of the environment. We model the problem as a game between our algorithm (player) and the environment of the program (adversary), where the player seeks to accurately predict the property of interest while the adversary sets environment states and parameters. To solve this problem, we employ a randomized strategy that repeatedly tests the program along a linear-sized set of program paths called basis paths, using the resulting measurements to infer a weighted-graph model of the environment, from which quantitative properties can be predicted. We prove that our algorithm can, under certain assumptions and with arbitrarily high probability, accurately predict properties such as worst-case execution time or estimate the distribution of execution times.

### Information Complexity of Black-Box Convex Optimization: A New Look via Feedback Information Theory (with M. Raginsky)

Allerton Conference on Communication, Control, and Computing, 2010

This paper revisits information complexity of black-box convex optimization, first studied in the seminal work of Nemirovski and Yudin, from the perspective of feedback information theory. These days, large-scale convex programming arises in a variety of applications, and it is important to refine our understanding of its fundamental limitations. The goal of black-box convex optimization is to minimize an unknown convex objective function from a given class over a compact, convex domain using an iterative scheme that generates approximate solutions by querying an oracle for local information about the function being optimized. The information complexity of a given problem class is defined as the smallest number of queries needed to minimize every function in the class to some desired accuracy. We present a simple information-theoretic approach that not only recovers many of the results of Nemirovski and Yudin, but also gives some new bounds pertaining to optimal rates at which iterative convex optimization schemes approach the solution. As a bonus, we give a particularly simple derivation of the minimax lower bound for a certain active learning problem on the unit interval.

This paper revisits information complexity of black-box convex optimization, first studied in the seminal work of Nemirovski and Yudin, from the perspective of feedback information theory. These days, large-scale convex programming arises in a variety of applications, and it is important to refine our understanding of its fundamental limitations. The goal of black-box convex optimization is to minimize an unknown convex objective function from a given class over a compact, convex domain using an iterative scheme that generates approximate solutions by querying an oracle for local information about the function being optimized. The information complexity of a given problem class is defined as the smallest number of queries needed to minimize every function in the class to some desired accuracy. We present a simple information-theoretic approach that not only recovers many of the results of Nemirovski and Yudin, but also gives some new bounds pertaining to optimal rates at which iterative convex optimization schemes approach the solution. As a bonus, we give a particularly simple derivation of the minimax lower bound for a certain active learning problem on the unit interval.

### A Stochastic View of Optimal Regret through Minimax Duality (with J. Abernethy, A. Agarwal, and P. Bartlett)

COLT, 2009

We study the regret of optimal strategies for online convex optimization games. Using von Neumann's minimax theorem, we show that the optimal regret in this adversarial setting is closely related to the behavior of the empirical minimization algorithm in a stochastic process setting: it is equal to the maximum, over joint distributions of the adversary's action sequence, of the difference between a sum of minimal expected losses and the minimal empirical loss. We show that the optimal regret has a natural geometric interpretation, since it can be viewed as the gap in Jensen's inequality for a concave functional--the minimizer over the player's actions of expected loss--defined on a set of probability distributions. We use this expression to obtain upper and lower bounds on the regret of an optimal strategy for a variety of online learning problems. Our method provides upper bounds without the need to construct a learning algorithm; the lower bounds provide explicit optimal strategies for the adversary.

We study the regret of optimal strategies for online convex optimization games. Using von Neumann's minimax theorem, we show that the optimal regret in this adversarial setting is closely related to the behavior of the empirical minimization algorithm in a stochastic process setting: it is equal to the maximum, over joint distributions of the adversary's action sequence, of the difference between a sum of minimal expected losses and the minimal empirical loss. We show that the optimal regret has a natural geometric interpretation, since it can be viewed as the gap in Jensen's inequality for a concave functional--the minimizer over the player's actions of expected loss--defined on a set of probability distributions. We use this expression to obtain upper and lower bounds on the regret of an optimal strategy for a variety of online learning problems. Our method provides upper bounds without the need to construct a learning algorithm; the lower bounds provide explicit optimal strategies for the adversary.

### Beating the Adaptive Bandit with High Probability (with J. Abernethy)

COLT, 2009

We provide a principled way of proving O(\sqrt{T}) high-probability guarantees for partial-information (bandit) problems over arbitrary convex decision sets. First, we prove a regret guarantee for the full-information problem in terms of "local" norms, both for entropy and self-concordant barrier regularization, unifying these methods. Given one of such algorithms as a black-box, we can convert a bandit problem into a full-information problem using a sampling scheme. The main result states that a high-probability O(\sqrt{T}) bound holds whenever the black-box, the sampling scheme, and the estimates of missing information satisfy a number of conditions, which are relatively easy to check. At the heart of the method is a construction of linear upper bounds on confidence intervals. As applications of the main result, we provide the first known efficient algorithm for the sphere with an O(\sqrt{T}) high-probability bound. We also derive the result for the n-simplex, improving the O(\sqrt{nT\log(nT)}) bound of Auer et al by replacing the logT term with loglogT and closing the gap to the lower bound of Ω(\sqrt{nT}). While O(\sqrt{T}) high-probability bounds should hold for general decision sets through our main result, construction of linear upper bounds depends on the particular geometry of the set; we believe that the sphere example already exhibits the necessary ingredients. The guarantees we obtain hold for adaptive adversaries (unlike the in-expectation results of Abernethy et al) and the algorithms are efficient, given that the linear upper bounds on confidence can be computed.

We provide a principled way of proving O(\sqrt{T}) high-probability guarantees for partial-information (bandit) problems over arbitrary convex decision sets. First, we prove a regret guarantee for the full-information problem in terms of "local" norms, both for entropy and self-concordant barrier regularization, unifying these methods. Given one of such algorithms as a black-box, we can convert a bandit problem into a full-information problem using a sampling scheme. The main result states that a high-probability O(\sqrt{T}) bound holds whenever the black-box, the sampling scheme, and the estimates of missing information satisfy a number of conditions, which are relatively easy to check. At the heart of the method is a construction of linear upper bounds on confidence intervals. As applications of the main result, we provide the first known efficient algorithm for the sphere with an O(\sqrt{T}) high-probability bound. We also derive the result for the n-simplex, improving the O(\sqrt{nT\log(nT)}) bound of Auer et al by replacing the logT term with loglogT and closing the gap to the lower bound of Ω(\sqrt{nT}). While O(\sqrt{T}) high-probability bounds should hold for general decision sets through our main result, construction of linear upper bounds depends on the particular geometry of the set; we believe that the sphere example already exhibits the necessary ingredients. The guarantees we obtain hold for adaptive adversaries (unlike the in-expectation results of Abernethy et al) and the algorithms are efficient, given that the linear upper bounds on confidence can be computed.

### Matrix Regularization Techniques for Online Multitask Learning (with A. Agarwal and P. Bartlett)

Tech report, 2008

In this paper we examine the problem of prediction with expert advice in a setup where the learner is presented with a sequence of examples coming from different tasks. In order for the learner to be able to benefit from performing multiple tasks simultaneously, we make assumptions of task relatedness by constraining the comparator to use a lesser number of best experts than the number of tasks. We show how this corresponds naturally to learning under spectral or structural matrix constraints, and propose regularization techniques to enforce the constraints. The regularization techniques proposed here are interesting in their own right and multitask learning is just one application for the ideas. A theoretical analysis of one such regularizer is performed, and a regret bound that shows benefits of this setup is reported.

In this paper we examine the problem of prediction with expert advice in a setup where the learner is presented with a sequence of examples coming from different tasks. In order for the learner to be able to benefit from performing multiple tasks simultaneously, we make assumptions of task relatedness by constraining the comparator to use a lesser number of best experts than the number of tasks. We show how this corresponds naturally to learning under spectral or structural matrix constraints, and propose regularization techniques to enforce the constraints. The regularization techniques proposed here are interesting in their own right and multitask learning is just one application for the ideas. A theoretical analysis of one such regularizer is performed, and a regret bound that shows benefits of this setup is reported.

### Game-Theoretic Timing Analysis (with S. Seshia)

IEEE/ACM Conference on Computer-Aided Design, 2008

Estimating the worst-case execution time (WCET) of tasks is a key step in the design of reliable real-time software and systems. In this paper, we present a new, game-theoretic approach to estimating WCET based on performing directed measurements on the target platform. We model the estimation problem as a game between our algorithm (player) and the environment of the program (adversary), where the player seeks to find the longest path through the program while the adversary sets environment parameters to thwart the player. We present both theoretical and experimental results demonstrating the utility of our approach. On the theoretical side, we prove that our algorithm can converge to find the longest path with high probability. Experimental results indicate that our approach is competitive with an existing technique based on static analysis and integer programming. Moreover, the approach can be easily applied to even complex hardware/software platforms.

Estimating the worst-case execution time (WCET) of tasks is a key step in the design of reliable real-time software and systems. In this paper, we present a new, game-theoretic approach to estimating WCET based on performing directed measurements on the target platform. We model the estimation problem as a game between our algorithm (player) and the environment of the program (adversary), where the player seeks to find the longest path through the program while the adversary sets environment parameters to thwart the player. We present both theoretical and experimental results demonstrating the utility of our approach. On the theoretical side, we prove that our algorithm can converge to find the longest path with high probability. Experimental results indicate that our approach is competitive with an existing technique based on static analysis and integer programming. Moreover, the approach can be easily applied to even complex hardware/software platforms.

### Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization (with J. Abernethy and E. Hazan)

COLT, 2008

We introduce an efficient algorithm for the problem of online linear optimization in the bandit setting which achieves the optimal $O^*(\sqrt{T})$ regret. The setting is a natural generalization of the non-stochastic multi-armed bandit problem, and the existence of an efficient optimal algorithm has been posed as an open problem in a number of recent papers. We show how the difficulties encountered by previous approaches are overcome by the use of a self-concordant potential function. Our approach presents a novel connection between online learning and interior point methods.

We introduce an efficient algorithm for the problem of online linear optimization in the bandit setting which achieves the optimal $O^*(\sqrt{T})$ regret. The setting is a natural generalization of the non-stochastic multi-armed bandit problem, and the existence of an efficient optimal algorithm has been posed as an open problem in a number of recent papers. We show how the difficulties encountered by previous approaches are overcome by the use of a self-concordant potential function. Our approach presents a novel connection between online learning and interior point methods.

### High probability regret bounds for online optimization (with P. Bartlett, V. Dani, T. Hayes, S. Kakade, and A. Tewari)

COLT, 2008

We present a modification of the algorithm of Dani et al. for the online linear optimization problem in the bandit setting, which with high probability has regret at most $O^*(\sqrt{T})$ against an adaptive adversary. This improves on the previous algorithm of Dani et al whose regret is bounded \emph{in expectation } against an \emph{oblivious} adversary. We obtain the same dependence on the dimension ($n^{3/2}$) as that exhibited by Dani et al. The results of this paper rest firmly on those of Dani et al and the remarkable technique of Auer et al. for obtaining high-probability bounds via optimistic estimates. This paper answers an open question: it eliminates the gap between the high-probability bounds obtained in the full-information vs bandit settings.

We present a modification of the algorithm of Dani et al. for the online linear optimization problem in the bandit setting, which with high probability has regret at most $O^*(\sqrt{T})$ against an adaptive adversary. This improves on the previous algorithm of Dani et al whose regret is bounded \emph{in expectation } against an \emph{oblivious} adversary. We obtain the same dependence on the dimension ($n^{3/2}$) as that exhibited by Dani et al. The results of this paper rest firmly on those of Dani et al and the remarkable technique of Auer et al. for obtaining high-probability bounds via optimistic estimates. This paper answers an open question: it eliminates the gap between the high-probability bounds obtained in the full-information vs bandit settings.

### Optimal Strategies and Minimax Lower Bounds for Online Convex Games (with J. Abernethy, P. Bartlett, and A. Tewari)

COLT, 2008

A number of learning problems can be cast as an Online Convex Game: on each round, a learner makes a prediction $x$ from a convex set, the environment plays a loss function $f$, and the learner's long-term goal is to minimize regret. Algorithms have been proposed by Zinkevich, when $f$ is assumed to be convex, and Hazan et al., when $f$ is assumed to be strongly convex, that have provably low regret. We consider these two settings and analyze such games from a minimax perspective, proving minimax strategies and lower bounds in each case. These results prove that the existing algorithms are essentially optimal.

A number of learning problems can be cast as an Online Convex Game: on each round, a learner makes a prediction $x$ from a convex set, the environment plays a loss function $f$, and the learner's long-term goal is to minimize regret. Algorithms have been proposed by Zinkevich, when $f$ is assumed to be convex, and Hazan et al., when $f$ is assumed to be strongly convex, that have provably low regret. We consider these two settings and analyze such games from a minimax perspective, proving minimax strategies and lower bounds in each case. These results prove that the existing algorithms are essentially optimal.

### Adaptive Online Gradient Descent (with P. Bartlett and E. Hazan)

NIPS, 2007

We study the rates of growth of the regret in online convex optimization. First, we show that a simple extension of the algorithm of Hazan et al eliminates the need for a priori knowledge of the lower bound on the second derivatives of the observed functions. We then provide an algorithm, Adaptive Online Gradient Descent, which interpolates between the results of Zinkevich for linear functions and of Hazan et al for strongly convex functions, achieving intermediate rates between $\sqrt{T}$ and $\log T$. Furthermore, we show strong optimality of the algorithm. Finally, we provide an extension of our results to general norms.

We study the rates of growth of the regret in online convex optimization. First, we show that a simple extension of the algorithm of Hazan et al eliminates the need for a priori knowledge of the lower bound on the second derivatives of the observed functions. We then provide an algorithm, Adaptive Online Gradient Descent, which interpolates between the results of Zinkevich for linear functions and of Hazan et al for strongly convex functions, achieving intermediate rates between $\sqrt{T}$ and $\log T$. Furthermore, we show strong optimality of the algorithm. Finally, we provide an extension of our results to general norms.

### Online Discovery of Similarity Mappings (with J. Abernethy and P. Bartlett)

ICML, 2007

We consider the problem of choosing, sequentially, a map which assigns elements of a set A to a few elements of a set B. On each round, the algorithm suffers some cost associated with the chosen assignment, and the goal is to minimize the cumulative loss of these choices relative to the best map on the entire sequence. Even though the offline problem of finding the best map is provably hard, we show that there is an equivalent online approximation algorithm, Randomized Map Prediction (RMP), that is efficient and performs nearly as well. While drawing upon results from the ``Online Prediction with Expert Advice'' setting, we show how RMP can be utilized as an online approach to several standard batch problems. We apply RMP to online clustering as well as online feature selection and, surprisingly, RMP often outperforms the standard batch algorithms on these problems.

We consider the problem of choosing, sequentially, a map which assigns elements of a set A to a few elements of a set B. On each round, the algorithm suffers some cost associated with the chosen assignment, and the goal is to minimize the cumulative loss of these choices relative to the best map on the entire sequence. Even though the offline problem of finding the best map is provably hard, we show that there is an equivalent online approximation algorithm, Randomized Map Prediction (RMP), that is efficient and performs nearly as well. While drawing upon results from the ``Online Prediction with Expert Advice'' setting, we show how RMP can be utilized as an online approach to several standard batch problems. We apply RMP to online clustering as well as online feature selection and, surprisingly, RMP often outperforms the standard batch algorithms on these problems.

### Multitask Learning with Expert Advice (with J. Abernethy and P. Bartlett)

COLT, 2007

We consider the problem of prediction with expert advice in the setting where a forecaster is presented with several online prediction tasks. Instead of competing against the best expert separately on each task, we assume the tasks are related, and thus we expect that a few experts will perform well on the entire set of tasks. That is, our forecaster would like, on each task, to compete against the best expert chosen from a small set of experts. While we describe the "ideal" algorithm and its performance bound, we show that the computation required for this algorithm is as hard as computation of a matrix permanent. We present an efficient algorithm based on mixing priors, and prove a bound that is nearly as good for the sequential task presentation case. We also consider a harder case where the task may change arbitrarily from round to round, and we develop an efficient approximate randomized algorithm based on Markov chain Monte Carlo techniques.

We consider the problem of prediction with expert advice in the setting where a forecaster is presented with several online prediction tasks. Instead of competing against the best expert separately on each task, we assume the tasks are related, and thus we expect that a few experts will perform well on the entire set of tasks. That is, our forecaster would like, on each task, to compete against the best expert chosen from a small set of experts. While we describe the "ideal" algorithm and its performance bound, we show that the computation required for this algorithm is as hard as computation of a matrix permanent. We present an efficient algorithm based on mixing priors, and prove a bound that is nearly as good for the sequential task presentation case. We also consider a harder case where the task may change arbitrarily from round to round, and we develop an efficient approximate randomized algorithm based on Markov chain Monte Carlo techniques.

### Stability of K-means Clustering (with A. Caponnetto)

NIPS, 2006

We phrase K-means clustering as an empirical risk minimization procedure over a class $H_K$ and explicitly calculate the covering number for this class. Next, we show that stability of K-means clustering is characterized by the geometry of $H_K$ with respect to the underlying distribution. We prove that in the case of a unique global minimizer, the clustering solution is stable with respect to complete changes of the data, while for the case of multiple minimizers, the change of $\Omega(n^{1/2})$ samples defines the transition between stability and instability. While for a finite number of minimizers this result follows from multinomial distribution estimates, the case of infinite minimizers requires more refined tools. We conclude by proving that stability of the functions in $H_K$ implies stability of the actual centers of the clusters. Since stability is often used for selecting the number of clusters in practice, we hope that our analysis serves as a starting point for finding theoretically grounded recipes for the choice of K.

We phrase K-means clustering as an empirical risk minimization procedure over a class $H_K$ and explicitly calculate the covering number for this class. Next, we show that stability of K-means clustering is characterized by the geometry of $H_K$ with respect to the underlying distribution. We prove that in the case of a unique global minimizer, the clustering solution is stable with respect to complete changes of the data, while for the case of multiple minimizers, the change of $\Omega(n^{1/2})$ samples defines the transition between stability and instability. While for a finite number of minimizers this result follows from multinomial distribution estimates, the case of infinite minimizers requires more refined tools. We conclude by proving that stability of the functions in $H_K$ implies stability of the actual centers of the clusters. Since stability is often used for selecting the number of clusters in practice, we hope that our analysis serves as a starting point for finding theoretically grounded recipes for the choice of K.

### Stability Properties of Empirical Risk Minimization over Donsker Classes (with A. Caponnetto)

Journal of Machine Learning Research, 2006

We study some stability properties of algorithms which minimize (or almost-minimize) empirical error over Donsker classes of functions. We show that, as the number $n$ of samples grows, the $L_2$-diameter of the set of almost-minimizers of empirical error with tolerance $\xi(n)=o(n^{-\frac{1}{2}})$ converges to zero in probability. Hence, even in the case of multiple minimizers of expected error, as $n$ increases it becomes less and less likely that adding a sample (or a number of samples) to the training set will result in a large jump to a new hypothesis. Moreover, under some assumptions on the entropy of the class, along with an assumption of Komlos-Major-Tusnady type, we derive a power rate of decay for the diameter of almost-minimizers. This rate, through an application of a uniform ratio limit inequality, is shown to govern the closeness of the expected errors of the almost-minimizers. In fact, under the above assumptions, the expected errors of almost-minimizers become closer with a rate strictly faster than $n^{-1/2}$.

We study some stability properties of algorithms which minimize (or almost-minimize) empirical error over Donsker classes of functions. We show that, as the number $n$ of samples grows, the $L_2$-diameter of the set of almost-minimizers of empirical error with tolerance $\xi(n)=o(n^{-\frac{1}{2}})$ converges to zero in probability. Hence, even in the case of multiple minimizers of expected error, as $n$ increases it becomes less and less likely that adding a sample (or a number of samples) to the training set will result in a large jump to a new hypothesis. Moreover, under some assumptions on the entropy of the class, along with an assumption of Komlos-Major-Tusnady type, we derive a power rate of decay for the diameter of almost-minimizers. This rate, through an application of a uniform ratio limit inequality, is shown to govern the closeness of the expected errors of the almost-minimizers. In fact, under the above assumptions, the expected errors of almost-minimizers become closer with a rate strictly faster than $n^{-1/2}$.

### Risk Bounds for Mixture Density Estimation (with D. Panchenko and S. Mukherjee)

ESAIM Probability and Statistics, 2005

In this paper we focus on the problem of estimating a bounded density using a finite combination of densities from a given class. We consider the Maximum Likelihood Estimator (MLE) and the greedy procedure described by Li and Barron \cite{LiBarron99,Li99} under the additional assumption of boundedness of densities. We prove an $O(\frac{1}{\sqrt{n}})$ bound on the estimation error which does not depend on the number of densities in the estimated combination. Under the boundedness assumption, this improves the bound of Li and Barron by removing the $\log n$ factor and also generalizes it to the base classes with converging Dudley integral.

In this paper we focus on the problem of estimating a bounded density using a finite combination of densities from a given class. We consider the Maximum Likelihood Estimator (MLE) and the greedy procedure described by Li and Barron \cite{LiBarron99,Li99} under the additional assumption of boundedness of densities. We prove an $O(\frac{1}{\sqrt{n}})$ bound on the estimation error which does not depend on the number of densities in the estimated combination. Under the boundedness assumption, this improves the bound of Li and Barron by removing the $\log n$ factor and also generalizes it to the base classes with converging Dudley integral.

### Stability Results in Learning Theory (with S. Mukherjee and T. Poggio)

Analysis and Applications, Special Issue on Learning Theory, 2005

The problem of proving generalization bounds for the performance of learning algorithms can be formulated as a problem of bounding the bias and variance of estimators of the expected error. We show how various {\it stability assumptions} can be employed for this purpose. We provide a necessary and sufficient stability condition for bounding the bias and variance for the Empirical Risk Minimization algorithm, and various sufficient conditions for bounding bias and variance of estimators for general algorithms. We discuss settings in which it is possible to obtain exponential bounds, and we prove an extension of the bounded-difference inequality for "almost always" stable algorithms.

The problem of proving generalization bounds for the performance of learning algorithms can be formulated as a problem of bounding the bias and variance of estimators of the expected error. We show how various {\it stability assumptions} can be employed for this purpose. We provide a necessary and sufficient stability condition for bounding the bias and variance for the Empirical Risk Minimization algorithm, and various sufficient conditions for bounding bias and variance of estimators for general algorithms. We discuss settings in which it is possible to obtain exponential bounds, and we prove an extension of the bounded-difference inequality for "almost always" stable algorithms.