Discounted Markov Decision Processes with Constrained Costs: the decomposition approach

. In this paper we consider a constrained optimization of discrete time Markov Decision Processes (MDPs) with finite state and action spaces, which accumulate both a reward and costs at each decision epoch. We will study the problem of finding a policy that maximizes the expected total discounted reward subject to the constraints that the expected total discounted costs are not greater than given values. Thus, we will investigate the decomposition method of the state space into the strongly communicating classes for computing an optimal or a nearly optimal stationary policy. The discounted criterion has many applications in several areas such that the Forest Management, the Management of Energy Consumption, the finance, the Communication System (Mobile Networks) and the artificial intelligence.


Introduction
The decomposition method consists in dividing the space of states into subsets which are weakly coupled. This technique was first introduced by Bather [1]. In his context, the decomposition of the state space is described and based on the accessibility between the states. The state space is divided into several Levels. Following Ross and Varadarajan [5] have presented a similar decomposition method to solve the constrained problem of the long-time average Markov Decision Processes. In this decomposition, the state space is partitioned into Strongly Communicating Classes and a set (perhaps empty) of transient states. Next, Baykal-Gursoy and Ross [6], Daoui and Abbad [7] investigated the same decomposition to solve the unconstrained problem of the long-time average. We model in this work the environment as a Constrained Markov Decision Processes, defined by a tuple where S is the set of states , A is the set of actions , is the transition probability, is the reward function which denotes immediate reward incurred by taking action in state , is the cost function upper bounded by , of cost constraint, is the discount factor and is the initial fixed state. The goal is to compute an optimal policy that maximizes the expected cumulative discounted rewards earned at state while expected cumulative discounted costs are bounded: Define the random variable by ∑ Here, { } is the state process taking values in the finite space S and { } is the action process taking values in the finite action space . The notation represents the indicator function. Set where denotes the set of all stationary policies.
A stationary policy is called optimal (-optimal) if ( ).
We will solve the problem of the constrained discounted Markov Decision Processes exploiting the decomposition of the state space into the strongly communicating classes by steps. First, we solve the restricted MDPs in subsection 3.1. We introduce a new MDP called intermediate MDP in subsection 3.2. We find a corresponding optimal policy. In section 4, we combine the results in subsections 3.1 and 3.2 in order to construct a nearly optimal policy for the original problem.

Sample space, policies and measures
The finite state and action spaces are denoted by S and A, respectively. The sample space is given by { } , so that the typical realization can be represented as . The state and action random variables , for are then defined as the coordinate mappings and . The sample space will be equipped with the -algebra  generated by the random variables .
In order to give a formal definition of a policy, first let  be the set of all probability measures on the action space , i.e:  (1) where is the law of motion, which is given and determined from the physical of the problem. From a stantard application of the Kolmogorov consistency theorem, we know there exists a unique probability measure , on  such that (1) Fig. 1. The movement of a robot

Decomposition theory
The state space S has a naturel partition into strongly communicating classes ,..., and a set of states . This decomposition has the following properties: i) The states in are transient under all stationary policies. ii) If is a recurrent class associated with some stationary policy, then is contained in one of the strongly communicating classes. iii) There exists a stationary policy whose associated recurrent classes exactly correspond to the strongly communicating classes. iv) Under any stationary policy and initial state given, we have abbreviates "almost always". Hence, for all policy u, the state process eventually enters one of the strongly communicating classes and remains forever. v) The partition { } can be obtained by an efficient polynomial-time algorithm (KW. Ross and R. Varadarajan). It is based on a depth-first procedure of graph theory.
In this subsection, any restricted MDP of the given MDP has the same laws of motion as the original MDP. Given an MDP M, with state space , we define the state-dependant action spaces to be , for all . Then, we invoke the recursive procedure FIND-CLASSES to find the strongly communicating classes of M.

Fig.2. The state transition diagram
Ross-Varadarajan algorithm provides de strongly communicating classes: { }; { } and the set of transient states: { }.

Fig.3. The decomposition method
We set for each state , :

{ }
By starting from a state , the set contains the actions which guarantee that the state process will remain in the strongly communicating class .

Restricted MDPs
For each we define a new MDP, called MDP i as follows: 1) The state space is ; 2) For each , the set of available actions is given by the state-dependent action spaces ; 3) The laws of motion, cost and reward functions are the same as for the original MDP but restricted to the state-dependent action spaces .
Proposition 2 (see [5] Proof. If , then . Since the reward function is bounded below (due to the finite state and action spaces), the result then directly follows from theorem 2.

Remark 1. Corollary 1, combined with the fact that { }
form a partition of the sample space implies that is nonempty whenever is nonempty.

Intermediate MDP
Over the original sample, define for each policy the following expected time-average reward In this subsection we consider the unconstrained problem of maximizing over all stationary policy and . From the proposition 1 and the Lebesgue's dominated convergence theorem we have for all policy It is well known that there exists an optimal pure policy for this problem which can be found by standard tools of the dynamic programming such that policy improvement, value iteration algorithm or linear programming approach (see Ross and Varadarajan [5], Baykal-Gursoy and Ross [6], Puterman [11], Bertsekas [16]). The following lemma gives an upper bound for the supremum of the original discounted reward. Without loss of generality, we may assume that each is closed under , (Otherwise, modify so that for all . Clearly the modified policy has the desired property, and is not difficult to show that it continues to maximize ).

Aggregated MDP
For solving the intermediate MDP problem, we use the well-known technique that is called aggregated MDP method.
The aggregated MDP is defined as follows:

An optimal policy for the original MDP
In this section we construct a stationary policy as follows: 1) For each let be the optimal stationary policy for the MDP i as given in subsection 3.1; 2) Let be the optimal policy as given in subsection 3.2. Let the set of such that is closed under ; 3) Define a stationary policy as follows: when in state with , apply the policy , otherwise apply . Theorem 3. The stationary policy as constructed above is optimal for the original problem.

Proof.
Since is identical to outside of ⋃ , and since is closed under both and , we get for all (8) From Rosenthal [9] and the fact that is identical to over for each , we have for Thus, is a feasible policy for the original problem (i.e. ).
In the other hand, we have ∑ ( ) ∑ By combining (5), (7), (8) and lemma 1, we get ∑ Hence, the stationary policy is optimal for the original problem.

Remark 2.
If there exists some such that the policy is nearly optimal for the MDP i , then is a nearly optimal for the original problem.

Conclusion
The theoretical framework of Markov decision processes gives the semantic foundation for a wide range of problems involving planning under uncertainty. For solving large-scale MDPs, the decomposition method is required. This approach consists to lead a naturel decomposition of state space into subsets that are weekly coupled. Hence, each small MDP is solved separately via the linear programming. In artificial intelligence with many rooms, this literature will be well applied. Thus, the decomposition method allows reducing the complexity of computing an optimal or a nearly optimal policy for the Constrained Markov Decision Processes Problems using the intermediate MDP technique.
In the future work, we will study the possibility to apply dynamic programming tools for solving the constrained Markov decision processes in the discounted case.