REINFORCEMENT LEARNING AND OPTIMAL CONTROL

BOOKS, VIDEOLECTURES, AND COURSE MATERIAL

Dimitri P. Bertsekas

Recently Posted Videolectures

Reinforcement Learning, Model Predictive Control, and the Newton step for solving Bellman's equation Lecture at Harvard University, June, 2025. Slides.

Abstract Dynamic Programming, Reinforcement Learning, Newton's Method, and Gradient Optimization Lecture at the ASU Mathematics Department, April, 2025. Slides.

Computer Chess with Model Predictive Control and Reinforcement Learning At Reinforcement Learning Workshop, The Indian Institute of Science, Bengaluru, India, Jan. 2025. Slides.

Model Predictive Control, and Reinforcement Learning: A Unified Framework Based on Dynamic Programming, Plenary talk at IFAC NMPC, Kyoto, Japan, Aug. 2024. Slides.

Distributed and Multiagent Reinforcement Learning , at Workshop on Intersections between Control, Learning and Optimization, IPAM, 2020. Slides.

Abandon reinforcement learning in favor of model predictive control: An interesting statement by deep learning pioneer Yann LeCun. See the followup comments and further discussion

The Textbook for my Reinforcement Learning Course at ASU

2nd Edition, 2025

Free download

A Review by Antonio Montano

Author's Description

This is the main textbook I use for my course at ASU. It is based on the class notes and the books I wrote over the years 2019-2025. It is a standalone book, but can also be used in conjunction with my videolectures and slides, available at this site.

The book can be downloaded and used freely for instructional purposes. It is available in digital form from Google Play, and will be soon available in print from the publishing company.

The book is about 500 pages long and includes solved end-of-chapter exercises. It places primary emphasis on intuitive reasoning, based on the mathematical framework of dynamic programming. While mathematical proofs are deemphasized, the textbook relies on the theoretical development and analysis given in my Dynamic Programming (DP) and Reinforcement Learning (RL) books listed at this site. All of these books share a consistent notation and terminology.

An important structural characteristic of the textbook is that it is organized in a modular way, with a view towards flexibility, so it can accommodate variations in course content. In particular, it is divided in two parts:

(1) A foundational platform, which consists of Chapter 1. It contains a selective overview of the approximate DP/RL landscape, and a starting point for a more detailed in-class development of other RL topics, whose choice can be at the instructor's discretion. It includes the conceptual framework and visualizations for approximation in value space based on Newton's method for solving the Bellman equation. It also sets a foundation for the development of the Model Predictive Control (MPC) methodology.

(2) An in-depth coverage of selected methodologies. In Chapter 2, we discuss methods of approximation in value space with one-step or multi-step lookahead. Methods of deterministic and stochastic rollout, and lookahead tree search receive special attention. Other topics of interest include multiagent rollout, adaptive control by rollout reoptimization, Bayesian optimization, and minimax problems. In Chapter 3, we discuss off-line training of neural networks and other approximation architectures, in conjunction with policy iteration/self-learning, Q-learning, policy gradient, and aggregation methods.

In a different course, other choices for in-depth coverage may be made, using the same foundational platform. For example, an optimal control/MPC/adaptive control course can be built upon the platform of Chapter 1. Similarly, more and less mathematically-oriented courses can be built upon this platform.

Chapter 1, Exact and Approximate Dynamic Programming. Contents: AlphaZero Off-Line Training, and On-Line Play, Deterministic Dynamic Programming, Stochastic Exact and Approximate Dynamic Programming, Infinite Horizon Problems - An Overview, Infinite Horizon Linear Quadratic Problems, Conceptual Framework for Approximation in Value Space - Newton's Method, Examples Reformulations, and Simplifications (POMDP, Model Predictive and Adaptive Control), Relations of Reinforcement Learning and Decision/Control.

Chapter 2, Approximation in Value Space - Rollout Algorithms. Contents: Deterministic Finite Horizon Problems, Approximation in Value Space - Deterministic Problems, Rollout Algorithms for Discrete Optimization, Rollout and Approximation in Value Space with Multistep Lookahead, Constrained Forms of Rollout Algorithms, Small Stage Costs and Long Horizon - Continuous-Time Rollout, Stochastic Rollout and Monte Carlo Tree Search, Rollout for Infinite-Spaces Problems - Optimization, Multiagent Rollout, Rollout for Bayesian Optimization and Sequential Estimation, Adaptive Control by Rollout with a POMDP Formulation, Rollout for Minimax Control, Application to Computer Chess.

Chapter 3, Learning Values and Policies. Contents: Parametric Approximation Architectures, Neural Networks, Training of Cost Functions in Approximate DP, Training of Policies in Approximate DP, Policy Gradient and Related Methods, Aggregation.

ABOUT THE SECOND EDITION

The second edition contains some major additions, including material that was covered in the 2024 and 2025 offerings of the course at ASU. In particular, a connection was established with the methodologies of transformers, large language models, and HMM inference (Section 2.3.7). The material on multistep search for deterministic problems was enriched (Section 2.4). The discussion on MPC was broadened, including material on its application to minimax problems and computer chess (Section 2.12). The material on policy gradient methods was substantially expanded (Section 3.5). At the same time, the structure and aims of the first edition have remained unchanged.

Supplementary Textbooks:

Lessons from AlphaZero for Optimal, Model Predictive, and Adaptive Control. The print copy of the book is available from the publishing company, Athena Scientific. The book is also available as an EBOOK from Google Play. Click here for a free .pdf copy of the book.

"Reinforcement Learning and Optimal Control," also available as an EBOOK from Google Play.

"Rollout, Policy Iteration, and Distributed Reinforcement Learning," also available as an EBOOK from Google Play. Click here for a free .pdf copy of the book.

REINFORCEMENT LEARNING COURSE AT ASU, SPRING 2025: VIDEOLECTURES, AND SLIDES

Syllabus of the 2025 Reinforcement Learning course at ASU

Videolectures and Slides:

Video-Lecture 1, (Introduction to DP/RL, Off-Line Training and On-Line Play Synergy, Deterministic Finite Horizon DP)

Video-Lecture 2, (Deterministic Rollout, Stochastic Finite and Infinite Horizon DP)

Video-Lecture 3, (Approximation in Value Space, Visualization with Newton's Method, State Augmentation, Multiagent Problems)

Video-Lecture 4, (Special Problems: POMDP, Adaptive Control, Model Predictive Control)

Video-Lecture 5, (Finite Horizon DP Review - Deterministic Rollout and Animations)

Video-Lecture 6, (Multistep Approximation in Value Space for Deterministic Problems, Constrained Rollout, Multiagent Rollout)

Video-Lecture 7, (Multiagent Rollout, Applications in Multi-Robot Warehouse Operation, Multitarget Tracking, and Data Association)

Video-Lecture 8, (Most Likely Sequence Generation for n-Grams, Transformers, HMMs, and Markov Chains)

Video-Lecture 9, (Rollout and Its Variants for Stochastic and Adaptive Control, Application to Wordle)

Video-Lecture 10, (Aggregation Methods, Applications to POMDP and Cybersecurity)

Video-Lecture 11, (Minimax Problems, Approximation in Value and Policy Space, Computer Chess)

Video-Lecture 12, (Off-Line Training, Approximation in Policy Space, Policy Gradient Methods)

Slides-Lecture 1, Slides-Lecture 2, Slides-Lecture 3, Slides-Lecture 4, Slides-Lecture 5, Slides-Lecture 6, Slides-Lecture 7, Slides-Lecture 8, Slides-Lecture 9, Slides-Lecture 10, Slides-Lecture 11, Slides-Lecture 12

Animations

Traveling salesman problem using the rollout algorithm, as described in the course textbook. The base heuristic is the nearest neighbor heuristic.

Value and policy iteration for scalar linear quadratic problems, as described in the course textbook.

Multiagent rollout, standard rollout, and greedy heuristic for spider and fly problem (see Lecture 6).

Multiagent rollout with order reshuffling for multiagent path finding problems (see Lecture 7).

Approximation in value space for multidimensional assignment problems (see Lecture 7).

REINFORCEMENT LEARNING COURSE AT ASU, SPRING 2024: VIDEOLECTURES, AND SLIDES

Syllabus of the 2024 Reinforcement Learning course at ASU

Complete Set of Videolectures and Slides:

Note that the 1st videolecture of 2024 is the same as the 1st videolecture of 2023 (the sound of the 1st videolecture of 2024 came out degraded). The contents of the 1st videolectures of 2023 and 2024 are very similar.

Video-Lecture 1, (Introduction - Deterministic Finite Horizon DP)

Video-Lecture 2, (Deterministic Rollout - Stochastic Finite and Infinite Horizon DP)

Video-Lecture 3, (Approximation in Value Space, Visualization with Newton's Method)

Video-Lecture 4, (Special Problems: POMDP, Adaptive Control, Model Predictive Control)

Video-Lecture 5, (Deterministic Rollout - Finite Horizon DP)

Video-Lecture 6, (Multistep Approximation in Value Space, Constrained Rollout, Multiagent Rollout)

Video-Lecture 7, (Applications of Approximation in Value Space: Multiagent Problems, Data Association)

Video-Lecture 8, (Rollout and Approximation in Value Space for Stochastic and Continuous Spaces Problems)

Video-Lecture 9, (Sequential Estimation, Bayesian Optimization, and Adaptive Control - Application to the Wordle Puzzle)

Video-Lecture 10, (Most Likely Sequence Generation for n-Grams, Transformers, HMMs, and Markov Chains)

Video-Lecture 11, (Off-Line Training Algorithms: Neural Nets, and Other Parametric Architectures)

Video-Lecture 12, (More on Off-Line Training - Aggregation)

Video-Lecture 13 (Approximate Linear Programming - Policy Gradient and Random Search Methods)

Additional Videolectures:

Lecture on Computer Chess and Model Predictive Control at Reinforcement Learning Workshop, the Indian Institute of Science, Bengaluru, Jan. 2025. Slides.

Model Predictive Control, and Reinforcement Learning: A Unified Framework Based on Dynamic Programming, Plenary talk at IFAC NMPC, August 2024.

The failure of theoretical error bounds in Reinforcement Learning: A 15-minute discussion at MIT on the role of the Newton step interpretation of approximation in value space in understanding the puzzle of success and failure in RL, Oct. 2023. Slides.

Overview of the book Lessons from AlphaZero: A summary one-hour presentation at MIT, Oct. 2022. Slides.

Overview of the book Lessons from AlphaZero: A fairly detailed two-hour presentation at KTH, Nov. 2021. Slides.

Overview Lecture on Multiagent Reinforcement Learning at ASU, Oct. 2020. Slides.

Overview Lecture on Distributed Reinforcement Learning from IPAM workshop at UCLA, Feb. 2020. Slides.

REINFORCEMENT LEARNING COURSE AT ASU, SPRING 2023: VIDEOLECTURES, AND SLIDES

Syllabus of the 2023 Reinforcement Learning course at ASU

Video-Lecture 1, Video-Lecture 2, Video-Lecture 3, Video-Lecture 4, Video-Lecture 5, Video-Lecture 6, Video-Lecture 7, Video-Lecture 8 (multiagent rollout demo), Video-Lecture 9, Video-Lecture 10, Video-Lecture 11

Slides-Lecture 1, Slides-Lecture 2, Slides-Lecture 3, Slides-Lecture 4, Slides-Lecture 5, Slides-Lecture 6, Slides-Lecture 7, Slides-Lecture 8, Slides-Lecture 9, Slides-Lecture 10, Slides-Lecture 11

REINFORCEMENT LEARNING COURSE AT ASU, SPRING 2022: VIDEOLECTURES, AND SLIDES

Video-Lecture 1, Video-Lecture 2, Video-Lecture 3, Video-Lecture 4, Video-Lecture 5, Video-Lecture 6, Video-Lecture 7, Video-Lecture 8, Video-Lecture 9, Video-Lecture 10, Video-Lecture 11, Video-Lecture 12, Video-Lecture 13

Slides-Lecture 1, Slides-Lecture 2, Slides-Lecture 3, Slides-Lecture 4, Slides-Lecture 5, Slides-Lecture 6, Slides-Lecture 7 Slides-Lecture 8 Slides-Lecture 9 Slides-Lecture 10 Slides-Lecture 11 Slides-Lecture 12 Slides-Lecture 13

The final Lecture 13 is an overview of the entire course.

REINFORCEMENT LEARNING COURSE AT ASU, SPRING 2021: VIDEOLECTURES, AND SLIDES

Videolectures, slides, and other material for the 2021 course on Reinforcement Learning and Optimal Control, at Arizona State University:

The final Lecture 13 is an overview of the entire course.

REINFORCEMENT LEARNING COURSE AT ASU, SPRING 2019: VIDEOLECTURES, AND SLIDES

Video-Lecture 1, Video-Lecture 2, Video-Lecture 3,Video-Lecture 4, Video-Lecture 5, Video-Lecture 6, Video-Lecture 7, Video-Lecture 8, Video-Lecture 9, Video-Lecture 10, Video-Lecture 11, Video-Lecture 12, Video-Lecture 13.

Lecture 13 is an overview of the entire course.

REINFORCEMENT LEARNING AND OPTIMAL CONTROL BOOK, Athena Scientific, 2019

The print version of the book is available from the publishing company Athena Scientific, or from Amazon.com. The book is also available as an Ebook from Google Books.

The purpose of the book is to consider large and challenging multistage decision problems, which can be solved in principle by dynamic programming and optimal control, but their exact solution is computationally intractable. We discuss solution methods that rely on approximations to produce suboptimal policies with adequate performance. These methods are collectively referred to as reinforcement learning, and also by alternative names such as approximate dynamic programming, and neuro-dynamic programming.

Our subject has benefited enormously from the interplay of ideas from optimal control and from artificial intelligence. One of the aims of this monograph is to explore the common boundary between these two fields and to form a bridge that is accessible by workers with background in either field.

The mathematical style of the book is somewhat different from the author's dynamic programming books, and the neuro-dynamic programming monograph, written jointly with John Tsitsiklis. We rely more on intuitive explanations and less on proof-based insights. Still we provide a rigorous short account of the theory of finite and infinite horizon dynamic programming, and some basic approximation methods, in an appendix. For this we require a modest mathematical background: calculus, elementary probability, and a minimal use of matrix-vector algebra.

The methods of this book have been successful in practice, and often spectacularly so, as evidenced by recent amazing accomplishments in the games of chess and Go. However, across a wide range of problems, their performance properties may be less than solid. This is a reflection of the state of the art in the field: there are no methods that are guaranteed to work for all or even most problems, but there are enough methods to try on a given challenging problem with a reasonable chance that one or more of them will be successful in the end. Accordingly, we have aimed to present a broad range of methods that are based on sound principles, and to provide intuition into their properties, even when these properties do not include a solid performance guarantee. Hopefully, with enough exploration with some of these methods and their variations, the reader will be able to address adequately his/her own problem.

ROLLOUT, POLICY ITERATION, AND DISTRIBUTED REINFORCEMENT LEARNING BOOK, Athena Scientific, 2020

The print version of the book is available from the publishing company Athena Scientific, and from Amazon.com. The book is also available as an Ebook from Google Books. Click here for a free .pdf copy of the book.

This is a research monograph at the forefront of research on reinforcement learning, also referred to by other names such as approximate dynamic programming and neuro-dynamic programming. The purpose of the monograph is to develop in greater depth some of the methods from the author's recently published textbook on Reinforcement Learning (Athena Scientific, 2019). In particular, we present new research, relating to systems involving multiple agents, partitioned architectures, and distributed asynchronous computation. We pay special attention to the contexts of dynamic programming/policy iteration and control theory/model predictive control. We also discuss in some detail the application of the methodology to challenging discrete/combinatorial optimization problems, such as routing, scheduling, assignment, and mixed integer programming, including the use of neural network approximations within these contexts.

The book focuses on the fundamental idea of policy iteration, i.e., start from some policy, and successively generate one or more improved policies. If just one improved policy is generated, this is called rollout, which, based on broad and consistent computational experience, appears to be one of the most versatile and reliable of all reinforcement learning methods. Among others, it can be applied on-line using easily implementable simulation, and it can be used for discrete deterministic combinatorial optimization, as well as for stochastic Markov decision problems. Moreover, rollout can make on-line use of the policy produced off-line by policy iteration or by any other method (including a policy gradient method), and improve on the performance of that policy,

Much of the new research is inspired by the remarkable AlphaZero chess program, where policy iteration, value and policy networks, approximate lookahead minimization, and parallel computation all play an important role. In addition to the fundamental process of successive policy iteration/improvement, this program includes the use of deep neural networks for representation of both value functions and policies, the extensive use of large scale parallelization, and the simplification of lookahead minimization, through methods involving Monte Carlo tree search and pruning of the lookahead tree. In this book, we also focus on policy iteration, value and policy neural network representations, multiagent systems, parallel and distributed computation, and lookahead simplification. Thus while there are significant differences, the principal design ideas that form the core of this monograph are shared by the AlphaZero architecture, except that we develop these ideas within a broader and less application-specific framework.

RESEARCH MONOGRAPH: LESSONS FROM ALPHAZERO, Athena Scientific, 2022

Lessons from AlphaZero for Optimal, Model Predictive, and Adaptive Control, (Athena Scientific, 2022); click here for a free .pdf copy of the book. Click here for the EBOOK version from Google Play.

The print copy of the book is available from the publishing company, Athena Scientific.

A journal paper: D. P. Bertsekas, "Newton's Method for Reinforcement Learning and Model Predictive Control," Results in Control and Optimization J., Vol. 7, 2022, pp. 100-121.

Lessons from AlphaZero Videolecture: A summary presentation at KTH, Nov. 2021

Slides from the KTH Lecture

A brief description: Some of the most exciting success stories in artificial intelligence and reinforcement learning have been in the area of games. Primary examples are the recent AlphaZero program (which plays chess), and the similarly structured and earlier (1990s) TD-Gammon program (which plays backgammon). These programs were trained off-line extensively using sophisticated self-play/approximate policy iteration algorithms and neural networks. Yet the AlphaZero player that has been obtained off-line is not used directly during on-line play (it is too inaccurate due to approximation errors that are inherent in off-line neural network training). Instead a separate on-line player is used, which is based on multistep lookahead and a terminal position evaluator that was trained off-line. The on-line player performs a form of policy improvement, which unlike the off-line player, is not degraded by neural network approximations. As a result, it has greatly improved performance.

Similarly, TD-Gammon performs on-line a policy improvement step using lookahead minimization that is not degraded by neural network approximations. To this end it uses an off-line neural-network trained terminal position evaluator, and importantly it also extends its on-line lookahead by rollout (simulation with the one-step lookahead player that is based on the position evaluator).

An important lesson from AlphaZero and TD-Gammon is that performance of an off-line trained controller can be greatly improved by on-line play, with long lookahead (involving minimization or rollout with an off-line obtained policy, or both), and terminal cost approximation that is obtained off-line. This performance enhancement is often dramatic and is due to a simple fact, which is the central point of our work: on-line play amounts to a step of Newton's method for solving Bellman's equation, while the starting point for the Newton step is based on the results of off-line training and may be enhanced by longer lookahead and on-line rollout. This process can be understood in terms of abstract models of dynamic programming and simple geometrical constructions. It manifests itself to some extent in model predictive control, but it seems that it has yet to be fully appreciated within the decision and control community.

In this work we aim to provide insights (often based on visualization), which explain the beneficial effects of on-line decision making on top of off-line training. While we will deemphasize mathematical proofs, there is considerable related analysis, which supports our conclusions and can be found in the author's recent RL books (2019, 2020). One of our principal aims is to show through the unifying principles of abstract DP that the AlphaZero/TD-Gammon ideas of approximation in value space and rollout apply very broadly to deterministic and stochastic optimal control problems, involving both discrete and continuous search spaces. Moreover, these ideas can be effectively integrated with other important methodologies such as model predictive control, adaptive control, decentralized control, discrete and Bayesian optimization, neural network-based value and policy approximations, and heuristic algorithms for discrete optimization.

REINFORCEMENT LEARNING SURVEYS, RESEARCH PAPERS, AND REPORTS

The following recent papers and reports have a strong connection to material in my reinforcement learning books, and amplify on their analysis and its range of applications.

Li, Y., and Bertsekas, D., "Semilinear Dynamic Programming: Analysis, Algorithms, and Certainty Equivalence Properties," arXiv:2501.04668, January, 2025.

Gundawar, A., Li, Y., and Bertsekas, D., "Superior Computer Chess with Model Predictive Control, Reinforcement Learning, and Rollout," arXiv:2409.06477, Sept. 2024.

Bertsekas, D., "Model Predictive Control, and Reinforcement Learning: A Unified Framework Based on Dynamic Programming," IFAC NMPC, March, 2024.

Li, Y., and Bertsekas, D., "Most Likely Sequence Generation for n-Grams, Transformers, HMMs, and Markov Chains, by Using Rollout Algorithms," arXiv:2403.15465, March, 2024.

Musunuru, P., Li, Y., Weber, J., and Bertsekas, D., "An Approximate Dynamic Programming Framework for Occlusion-Robust Multi-Object Tracking," arXiv:2405.15137, May 2024.

Bertsekas, D., "Rollout Algorithms and Approximate Dynamic Programming for Bayesian Optimization and Sequential Estimation," Dec. 2022, arXiv:2212.07998.

D. P. Bertsekas, "Distributed Asynchronous Policy Iteration for Sequential Zero-Sum Games and Minimax Control," arXiv preprint arXiv:2107.10406, July 2021.

D. P. Bertsekas, "On-Line Policy Iteration for Infinite Horizon Dynamic Programming," arXiv preprint arXiv:2106.00746, May 2021.

Bertsekas, D., "Multiagent Reinforcement Learning: Rollout and Policy Iteration," IEEE/CAA Journal of Automatica Sinica, Vol. 8, 2021, pp. 249-271. Video of an overview lecture.

D. P. Bertsekas, "Multiagent Value Iteration Algorithms in Dynamic Programming and Reinforcement Learning," ASU Report, April 2020; arXiv preprint, arXiv:2005.01627, May 2020; Results in Control and Optimization J., Vol. 1, 2020.

Bertsekas, D., "Multiagent Rollout Algorithms and Reinforcement Learning," arXiv preprint arXiv:1910.00120, September 2019 (revised April 2020).

Bertsekas, D., "Constrained Multiagent Rollout and Multidimensional Assignment with the Auction Algorithm," arXiv preprint, arXiv:2002.07407 February 2020.

D. P. Bertsekas, "Biased Aggregation, Rollout, and Enhanced Policy Improvement for Reinforcement Learning," Lab. for Information and Decision Systems Report, MIT, October 2018; a shorter version appears as arXiv preprint arXiv:1910.02426, Oct. 2019.

D. P. Bertsekas, "Feature-Based Aggregation and Deep Reinforcement Learning: A Survey and Some New Implementations," Lab. for Information and Decision Systems Report, MIT, April 2018 (revised August 2018); arXiv preprint arXiv:1804.04577; a version published in IEEE/CAA Journal of Automatica Sinica. (Lecture Slides). (Related Video Lecture).

Bhambri, S., Bhattacharjee, A., Bertsekas, D., "Playing Wordle Using an Online Rollout Algorithm for Deterministic POMDP," In Proc. of 2023 IEEE Conference on Games, Boston, MA.

Bhambri, S., Bhattacharjee, A., Bertsekas, D., "Reinforcement Learning Methods for Wordle: A POMDP/Adaptive Control Approach," Arizona State University/SCAI Report, Nov. 2022; arXiv:2211.10298.

Garces, D., Bhattacharya, S., Bertsekas, D., and Gil, G., "Approximate Multiagent Reinforcement Learning for On-Demand Urban Mobility Problem on a Large Map", ICRA, 2024.

Bhattacharya, S., Badyal, S., Wheeler, W., Gil, S., Bertsekas, D.,"Reinforcement Learning for POMDP: Partitioned Rollout and Policy Iteration with Application to Autonomous Sequential Repair Problems," IEEE Robotics and Automation Letters, Vol. 5, pp. 3967-3974, 2020.

Bhattacharya, S., Kailas, S., Badyal, S., Gil, S., Bertsekas, D.,"Multiagent Rollout and Policy Iteration for POMDP with Application to Multi-Robot Repair Problems," Proc. CORL, 2020; arXiv preprint, arXiv:2011.04222, November 2020.

RELATED BOOKS

Abstract Dynamic Programming, 3rd Edition, 2022

by Dimitri P. Bertsekas

The 3rd edition of the book is available as an Ebook from Google Books. The print version of the 3rd edition of the book is available from the publishing company, Athena Scientific.

Click here for a .pdf copy of the 3rd Edition of the book.

This research monograph provides a synthesis of old research on the foundations of dynamic programming (DP), with the modern theory of approximate DP and new research on semicontractive models. It aims at a unified and economical development of the core theory and algorithms of total cost sequential decision problems, based on the strong connections of the subject with fixed point theory. The analysis focuses on the abstract mapping that underlies DP and defines the mathematical character of the associated problem. The discussion centers on two fundamental properties that this mapping may have: monotonicity and (weighted sup-norm) contraction. It turns out that the nature of the analytical and algorithmic DP theory is determined primarily by the presence or absence of these two properties, and the rest of the problem's structure is largely inconsequential. New research is focused on two areas: 1) The ramifications of these properties in the context of algorithms for approximate DP, and 2) The new class of semicontractive models, exemplified by stochastic shortest path problems, where some but not all policies are contractive.

The 3rd edition is very similar to the 2nd edition, except for the addition of a new 40-page Chapter 5, which introduces a contractive abstract DP framework and related policy iteration algorithms, specifically designed for sequential zero-sum games and minimax problems with a general structure, and based on a recent paper by the author. Aside from greater generality, the advantage of our algorithms over alternatives is that they resolve some long-standing convergence difficulties of the natural PI algorithm, which have been known since the Pollatschek and Avi-Itzhak method for finite-state Markov games. Mathematically, this natural algorithm is a form of Newton's method for solving Bellman's equation, but Newton's method, contrary to the case of single-player DP problems, is not globally convergent in the case of a minimax problem, because of an additional difficulty: the Bellman operator may have components that are neither convex nor concave. The algorithms address this difficulty by introducing alternating player choices, and by using a policy-dependent mapping with a uniform sup-norm contraction property, similar to earlier works by Bertsekas and Yu, which is described in part in Chapter 2. Moreover, our algorithms allow a convergent and highly parallelizable implementation, which is based on state space partitioning, and distributed asynchronous policy evaluation and policy improvement operations within each set of the partition. They are also suitable for approximations based on an aggregation approach.

The 3rd and 2nd editions do not contain some material of the 1st edition that deals with restricted policies and Borel space models (Chapter 5 and Appendix C). These models are motivated in part by the complex measurability questions that arise in mathematically rigorous theories of stochastic optimal control involving continuous probability spaces. The restricted policies framework aims primarily to extend abstract DP ideas to Borel space models. Since this material is fully covered in Chapter 6 of the 1978 monograph by Bertsekas and Shreve, and followup research on the subject has been limited, Chapter 5 and Appendix C of the 1st edition were ommitted from the 2nd and 3rd editions and are posted below.

Chapter 5 of 1st Edition

Appendix C of 1st Edition

The following papers and reports have a strong connection to the book, and amplify on the analysis and the range of applications of the semicontractive models of Chapters 3 and 4:

D. P. Bertsekas, "Regular Policies in Abstract Dynamic Programming", Lab. for Information and Decision Systems Report LIDS-P-3173, MIT, May 2015; SIAM J. on Optimization, Vol. 27, No. 3, pp. 1694-1727. (Related Lecture Slides); (Related Video Lectures).

D. P. Bertsekas, "Value and Policy Iteration in Deterministic Optimal Control and Adaptive Dynamic Programming", Lab. for Information and Decision Systems Report LIDS-P-3174, MIT, May 2015 (revised Sept. 2015); IEEE Transactions on Neural Networks and Learning Systems, Vol. 28, 2017, pp. 500-509.

D. P. Bertsekas and H. Yu, "Stochastic Shortest Path Problems Under Weak Conditions", Lab. for Information and Decision Systems Report LIDS-P-2909, MIT, January 2016.

D. P. Bertsekas, "Robust Shortest Path Planning and Semicontractive Dynamic Programming," Lab. for Information and Decision Systems Report LIDS-P-2915, MIT, Feb. 2014 (revised Jan. 2015 and June 2016); arXiv preprint arXiv:1608.01670; Naval Research Logistics (NRL), 66(1), pp.15-37.

D. P. Bertsekas, "Affine Monotonic and Risk-Sensitive Models in Dynamic Programming", Lab. for Information and Decision Systems Report LIDS-3204, MIT, June 2016; arXiv preprint arXiv:1608.01393; IEEE Transactions on Aut. Control, Vol. 64, 2019, pp. 3117-3128.

D. P. Bertsekas, "Stable Optimal Control and Semicontractive Dynamic Programming," SIAM J. on Control and Optimization, Vol. 56, 2018, pp. 231-252, (Related Lecture Slides), (Related Video Lecture from MIT, May 2017). (Related Lecture Slides from UConn, Oct. 2017). (Related Video Lecture from UConn, Oct. 2017).

D. P. Bertsekas, "Proper Policies in Infinite-State Stochastic Shortest Path Problems," IEEE Transactions on Automatic Control, Vol. 63, 2018, pp. 3787-3792. (Related Lecture Slides).

An updated version of Chapter 4 of the author's Dynamic Programming book, Vol. II, which incorporates recent research on a variety of undiscounted problems and relates to abstract DP topics; (Related Lecture Slides).

A series of 5 Videolectures on Abstract Dynamic Programming and corresponding slides posted at Youtube.

Neuro-Dynamic Programming

Dimitri P. Bertsekas and John N. Tsitsiklis

Published November 1996

Click here for a free .pdf copy of the book.

This is historically the first book that fully explained the neuro-dynamic programming/reinforcement learning methodology, a breakthrough in the practical application of neural networks and dynamic programming to complex problems of planning, optimal decision making, and intelligent control.

Neuro-dynamic programming uses neural network approximations to overcome the "curse of dimensionality" and the "curse of modeling" that have been the bottlenecks to the practical application of dynamic programming and stochastic control to complex problems. The methodology allows systems to learn about their behavior through simulation, and to improve their performance through iterative reinforcement.

This book provides the first systematic presentation of the science and the art behind this exciting and far-reaching methodology. It develops a comprehensive analysis of reinforcement learning algorithms, and guides the reader to their successful application through case studies from complex problem areas. It contains material that is not available elsewhere in book form, such as a comprehensive and rigorous analysis of temporal difference methods, Q-learning, and error bounds associated with various methods.

Dynamic Programming and Optimal Control, Vol. 1, 4th Edition

Dimitri P. Bertsekas

Published February 2017

The fourth edition (February 2017) contains a substantial amount of new material, particularly on approximate DP in Chapter 6. This chapter was thoroughly reorganized and rewritten, to bring it in line, both with the contents of Vol. II, whose latest edition appeared in 2012, and with recent developments, which have propelled approximate DP to the forefront of attention.

Some of the highlights of the revision of Chapter 6 are an increased emphasis on one-step and multistep lookahead methods, parametric approximation architectures, neural networks, rollout, and Monte Carlo tree search. Among other applications, these methods have been instrumental in the recent spectacular success of computer Go programs. The material on approximate DP also provides an introduction and some perspective for the more analytically oriented treatment of Vol. II.

Click here for direct ordering from the publisher and preface, table of contents, supplementary educational material, lecture slides, videos, etc

Dynamic Programming and Optimal Control, Vol. II, 4th Edition: Approximate Dynamic Programming

Dimitri P. Bertsekas

Published June 2012, Updated 2018 and 2021

The fourth edition of Vol. II of the two-volume DP textbook was published in June 2012. This is a major revision of Vol. II and contains a substantial amount of new material, as well as a reorganization of old material. The length has increased by more than 60% from the third edition, and most of the old material has been restructured and/or revised. Volume II now numbers more than 700 pages and is larger in size than Vol. I. It can arguably be viewed as a new book!

Approximate DP has become the central focal point of this volume, and occupies more than half of the book (the last two chapters, and large parts of Chapters 1-3). Thus one may also view this new edition as a followup of the author's 1996 book "Neuro-Dynamic Programming" (coauthored with John Tsitsiklis). A lot of new material, the outgrowth of research conducted in the six years since the previous edition, has been included.

A new printing of the fourth edition (January 2018) contains some updated material, particularly on undiscounted problems in Chapter 4, and approximate DP in Chapter 6. References were also made to the contents of the 2017 edition of Vol. I, and to high profile developments in deep reinforcement learning, which have brought approximate DP to the forefront of attention.

MATHEMATICAL APPENDIX: MEASURABILITY ISSUES IN STOCHASTIC DP

Click here for Appendix A: Measure Theoretic Issues in Dynamic Programming, based on the 1978 monograph of D. P. Bertsekas and S. E. Shreve, "Dynamic Programming and Stochastic Control.

CHAPTER UPDATE - NEW MATERIAL

Click here for an updated version of Chapter 4, which incorporates recent research on a variety of undiscounted problem topics, including

Deterministic optimal control and adaptive DP (Sections 4.2 and 4.3).

Stochastic shortest path problems under weak conditions and their relation to positive cost problems (Sections 4.1.4 and 4.4).

Affine monotonic and multiplicative cost models (Section 4.5).

PREFACE, SLIDES, AND OTHER INFORMATION

Click here for preface and detailed information.

Click here to order at Amazon.com

Lectures on Exact and Approximate Finite Horizon DP: Videos from a 4-lecture, 4-hour short course at the University of Cyprus on finite horizon DP, Nicosia, 2017. Videos from Youtube. (Lecture Slides: Lecture 1, Lecture 2, Lecture 3, Lecture 4.)

Videos from a 6-lecture, 12-hour short course at Tsinghua Univ., Beijing, China, 2014. From the Tsinghua course site, and from Youtube. Click here to download Approximate Dynamic Programming Lecture slides, for this 12-hour video course.

Click here to download lecture slides for a 7-lecture short course on Approximate Dynamic Programming, Caradache, France, 2012.

Click here to download lecture slides for the MIT course "Dynamic Programming and Stochastic Control (6.231), Dec. 2015. The last six lectures cover a lot of the approximate dynamic programming material.

Click here to download research papers and other material on Dynamic Programming and Approximate Dynamic Programming.

Visits since April 9, 2019