动态规划(算法——动态规划)$V(x)=\max _{a\in \Gamma (x)}\{F(x,a)+\beta V(T(x,a))\}$$V(x)=\max _{a}\{F(x,a
本文以动态规划为标题,是因为动态规划并不全是一个计算机学科的问题
动态规划的两个核心特点是最优子结构和重叠子问题。
动态规划,是一个致力于简单朴素的设计方法。————qianxin
动态规划\(V(x)=\max _{a\in \Gamma (x)}\{F(x,a)+\beta V(T(x,a))\}\)
Dynamic programming is both a mathematical optimization method and a computer programming method. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, from aerospace engineering to economics.
In both contexts, it refers to simplifying a complicated problem by breaking it down into simpler sub-problems in a recursive manner. While some decision problems cannot be taken apart this way, decisions that span several points in time do often break apart recursively. Likewise, in computer science, if a problem can be solved optimally by breaking it into sub-problems and then recursively finding the optimal solutions to the sub-problems, then it is said to have an optimal substructure.
If sub-problems can be nested recursively inside larger problems, so that dynamic programming methods are applicable, then there is a relation between the value of the larger problem and the values of the sub-problems. In the optimization literature, this relationship is called the Bellman equation.
Richard E. Bellman ,美国应用数学家,于十九世纪五十年代发明了动态规划。动态规划在包括航天工程到经济学的很多领域都有使用。
计算机中的动态规划和数学上的不是非常相似,这方面也许需要严谨地定义一下。
Derivation of Bellman equation
A dynamic decision problem
Let the state at time \(t\) be \(x_{t}\). For a decision that begins at time 0, we take as given the initial state \(x_{0}\). At any time, the set of possible actions depends on the current state; we can write this as \(a_{t}\in \Gamma (x_{t})\), where the action \(a_{t}\) represents one or more control variables. We also assume that the state changes from \(x\) to a new state \(T(x,a)\) when action \(a\) is taken, and that the current payoff from taking action \(a\) in state \(x\) is \(F(x,a)\). Finally, we assume impatience, represented by a discount factor \(0<\beta <1\).
Under these assumptions, an infinite-horizon decision problem takes the following form:
\(V(x_{0})=\max _{\left\{a_{t}\right\}_{t=0}^{\infty }}\sum _{t=0}^{\infty }\beta^{t}F(x_{t},a_{t}),\)
subject to the constraints
\(a_{t}\in \Gamma (x_{t}),x_{t+1}=T(x_{t},a_{t}),\forall t=0,1,2,\dots\)
Notice that we have defined notation \(V(x_{0})\) to denote the optimal value that can be obtained by maximizing this objective function subject to the assumed constraints. This function is the value function. It is a function of the initial state variable \(x_{0},\) since the best value obtainable depends on the initial situation.
Bellman's principle of optimality
The dynamic programming method breaks this decision problem into smaller subproblems. Bellman's principle of optimality describes how to do this:
Principle of Optimality: An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision. (See Bellman, 1957, Chap. III.3.)
最优性原则:最优策略的性质是,无论初始状态和初始决策如何,其余决策都必须构成关于第一个决策产生的状态的最优策略。
In computer science, a problem that can be broken apart like this is said to have an optimal substructure. In the context of dynamic game theory, this principle is analogous to the concept of subgame perfect equilibrium, although what constitutes an optimal policy, in this case, is conditioned on the decision maker's opponents choosing similarly optimal policies from their points of view.
在计算机科学中,可以像这样分解的问题被称为具有最优子结构(但是不一定可以重叠子问题)。
在动态博弈论的背景下,该原则类似于子博弈完美均衡的概念,尽管在这种情况下构成最优策略的条件取决于决策者的对手从他们的角度选择类似的最优策略。
As suggested by the principle of optimality, we will consider the first decision separately, setting aside all future decisions (we will start afresh from time 1 with the new state \(x_{1}\). Collecting the future decisions in brackets on the right, the above infinite-horizon decision problem is equivalent to:
\(\max _{a_{0}}\left\{F(x_{0},a_{0})+\beta \left[\max _{\left\{a_{t}\right\}_{t=1}^{\infty }}\sum _{t=1}^{\infty }\beta ^{t-1}F(x_{t},a_{t}):a_{t}\in \Gamma (x_{t}),\;x_{t+1}=T(x_{t},a_{t}),\;\forall t\geq 1\right]\right\}\)
subject to the constraints
\(a_{0}\in \Gamma (x_{0}),\;x_{1}=T(x_{0},a_{0}).\)
Here we are choosing \(a_{0}\), knowing that our choice will cause the time 1 state to be \(x_{1}=T(x_{0},a_{0})\). That new state will then affect the decision problem from time 1 on. The whole future decision problem appears inside the square brackets on the right.
The Bellman equation
So far it seems we have only made the problem uglier by separating today's decision from future decisions. But we can simplify by noticing that what is inside the square brackets on the right is the value of the time 1 decision problem, starting from state \(x_{1}=T(x_{0},\ a_{0})\).
Therefore, we can rewrite the problem as a recursive definition of the value function:
\(V(x_{0})=\max _{a_{0}}\{F(x_{0},a_{0})+\beta V(x_{1})\}, subject\ to\ the\ constraints:\ a_{0}\in \Gamma (x_{0}),\;x_{1}=T(x_{0},a_{0}).\)
This is the Bellman equation. It can be simplified even further if we drop time subscripts and plug in the value of the next state:
\(V(x)=\max _{a\in \Gamma (x)}\{F(x,a)+\beta V(T(x,a))\}.\)
The Bellman equation is classified as a functional equation because solving it means finding the unknown function \(V\), which is the value function. Recall that the value function describes the best possible value of the objective, as a function of the state \(x\). By calculating the value function, we will also find the function \(a(x)\) that describes the optimal action as a function of the state; this is called the policy function.
贝尔曼方程被归类为函数方程,因为求解它意味着找到未知函数 V,它是值函数。回想一下,价值函数描述了目标的最佳可能值,作为状态 x 的函数。通过计算价值函数,我们还将找到函数 a(x) 将最优动作描述为状态的函数;这称为策略函数。
Markov decision processes
In Markov decision processes, a Bellman equation is a recursion for expected rewards. For example, the expected reward for being in a particular state s and following some fixed policy \(\pi\) has the Bellman equation:
\(V^{\pi }(s)=R(s,\pi (s))+\gamma \sum _{s'}P(s'|s,\pi (s))V^{\pi }(s').\)
This equation describes the expected reward for taking the action prescribed by some policy \(\pi\).
The equation for the optimal policy is referred to as the Bellman optimality equation:
\(V^{\pi *}(s)=\max _{a}\left\{{R(s,a)+\gamma \sum _{s'}P(s'|s,a)V^{\pi *}(s')}\right\}.\)
where \({\pi *}\) is the optimal policy and \(V^{\pi *}\) refers to the value function of the optimal policy. The equation above describes the reward for taking the action giving the highest expected return.
Computer programming \(V(x)=\max _{a}\{F(x,a)+\beta V(T(x,a))\}\)
There are two key attributes that a problem must have in order for dynamic programming to be applicable: optimal substructure and overlapping sub-problems. If a problem can be solved by combining optimal solutions to non-overlapping sub-problems, the strategy is called "divide and conquer" instead. This is why merge sort and quick sort are not classified as dynamic programming problems.
一个可以使用动态规划的问题具有两个性质:最优子结构和重叠子问题。其他领域的动态规划,往往不苛求重叠子问题。而满足重叠子问题的最优策略,是计算机动态规划算法的关键。
如果一个问题可以通过联合多个不重叠子问题的最优解得到答案,这个策略通常称为分而治之。因此,归并排序和快速排序不被归类为动态规划问题。
Optimal substructure means that the solution to a given optimization problem can be obtained by the combination of optimal solutions to its sub-problems. Such optimal substructures are usually described by means of recursion. For example, given a graph G=(V, E), the shortest path p from a vertex u to a vertex v exhibits optimal substructure: take any intermediate vertex w on this shortest path p. If p is truly the shortest path, then it can be split into sub-paths p1 from u to w and p2 from w to v such that these, in turn, are indeed the shortest paths between the corresponding vertices (by the simple cut-and-paste argument described in Introduction to Algorithms). Hence, one can easily formulate the solution for finding the shortest paths in a recursive manner, which is what the Bellman-Ford algorithm or the Floyd–Warshall algorithm does.
斐波那契数列和图的最短路径都展示了最优子结构。
Overlapping sub-problems means that the space of sub-problems must be small, that is, any recursive algorithm solving the problem should solve the same sub-problems over and over, rather than generating new sub-problems. For example, consider the recursive formulation for generating the Fibonacci series: \(Fi = Fi?1 + Fi?2\), with base case \(F1 = F2 = 1\). Then \(F43 = F42 + F41\), and \(F42 = F41 + F40\). Now \(F41\) is being solved in the recursive sub-trees of both \(F43\) as well as \(F42\). Even though the total number of sub-problems is actually small (only 43 of them), we end up solving the same problems over and over if we adopt a naive recursive solution such as this. Dynamic programming takes account of this fact and solves each sub-problem only once.
重叠子问题要求子问题足够小,可以通过递归算法不断解决子问题。
Top-down approach: This is the direct fall-out of the recursive formulation of any problem. If the solution to any problem can be formulated recursively using the solution to its sub-problems, and if its sub-problems are overlapping, then one can easily memorize or store the solutions to the sub-problems in a table. Whenever we attempt to solve a new sub-problem, we first check the table to see if it is already solved. If a solution has been recorded, we can use it directly, otherwise, we solve the sub-problem and add its solution to the table.
Bottom-up approach: Once we formulate the solution to a problem recursively as in terms of its sub-problems, we can try reformulating the problem in a bottom-up fashion: try solving the sub-problems first and use their solutions to build on and arrive at solutions to bigger sub-problems. This is also usually done in a tabular form by iteratively generating solutions to bigger and bigger sub-problems by using the solutions to small sub-problems. For example, if we already know the values of F41 and F40, we can directly calculate the value of F42.
Some programming languages can automatically memoize the result of a function call with a particular set of arguments, in order to speed up call-by-name evaluation (this mechanism is referred to as call-by-need). Some languages make it possible portably (e.g. Scheme, Common Lisp, Perl or D). Some languages have automatic memoization built in, such as tabled Prolog and J, which supports memoization with the M. adverb.[4] In any case, this is only possible for a referentially transparent function. Memoization is also encountered as an easily accessible design pattern within term-rewrite based languages such as Wolfram Language.