Algorithms of approximate dynamic programming for hydro scheduling

In hydro scheduling, unit commitment is a complex sub-problem. This paper proposes a new approximate dynamic programming technique to solve unit commitment. A new method called Least Square Policy Iteration (LSPI) algorithm is introduced which is efficient and faster in convergence. This algorithm takes the properties of widely used algorithm least square temporal difference (LSTD), enhance it further and make it useful for optimization problems. First value function is to find a fixed policy by using least square temporal difference Q (LSTDQ) algorithm which is similar to LSTD, then LSPI is introduced for making the policy iteration algorithm by using the results of LSTDQ. It combines the data efficiency of LSTDQ and policy-search efficiency of policy iteration.


Introduction
Electric power system deals with generating a suitable schedule for unit commitment (UC) problem over a day/week. The aim of this scheduling is to reduce the cost and increase the profit. The problem become complex when it considers different constraints like spinning reserve, power balance, minimum up/down time, network security constraints etc. [1]. Unit commitment involves integer decision variables which also make it more complicated [2]. There are several methods which solve unit commitment problem like dynamic programming, lagrange relaxation, mixed integer linear programming, genetic algorithms, fuzzy logics etc. Among them dynamic programming lagrange relaxation and mixed integer linear programming are commonly used. For multi period decision problems dynamic programming is good. Dynamic programming deals with the variables directly and give optimum global decision but the problem with dynamic programming occurs when it solves all the configurations of units at each step of UC problem and this problem is called "curse of dimensionality". The initial phase for the development of approximate dynamic programming is "curse of dimensionality". This field is emerging under names like heuristic dynamic programming, adaptive dynamic programming, reinforcement learning and neuro dynamic programming. The field of control theory was the first which had started the computational method for dynamic programming practical. In 1980 field of computer science proposed the name of reinforcement learning [3]. Reinforcement learning mainly emphasized on discretizing action and spaces, it also focuses on continuous problems. Some models related to dynamic management had evolved using the adaptive dynamic programming in 1990's. Most of the work done on approximate dynamic programming is in 20's by [4].
A variety of approximate dynamic programming models/algorithms are introduced in recent years. Approximate dynamic programming can solve large number of markov decision problems. The basic of approximate dynamic programming is that instead of calculating all the optimal states, this method approximate the feasible states and in this way it solves the problem of "curse of dimensionality". The paper proposes the ADP technique to solve unit commitment problem by least square policy iteration algorithm. After the enlargement of LSPI algorithm, LSTDQ is introduced first which is used in policy evaluation step of policy iteration [5][6][7][8]. LSTDQ is good in finding the value function for a fixed policy. It uses data efficiently and converge faster than other methods, but it creates a problem [9] as it oscillates in MDP between two bad policies having 4 states just in number. LSPI is good in eliminating the parameters which require careful tuning.

Approximate Policy Iteration
The policy iteration algorithm is used to make policies by using value functions. This algorithm has two parts one policy evaluation and second policy improvement. In large and continuous spaces, it is difficult to find the policy evaluation, so value function is approximated. This algorithm is presented in figure 1.
3. The transition model: P (s, a, � � � is also called as the system model. The probability P make a transition to state � � while doing an action a. 4. Cost Function: When transition occurs cost function is obtained. 5. Discount Factor: For rewards in future γ [0,1] is used as a discount factor.

Least square temporal difference Q
The LSTDQ algorithm solves the linear system by using inversion of � . It uses the Basis function of state and action as shown in figure 2. It also controls the distribution of samples. It faces the singularity problem while using inversion � , this problem is minimized by using matrix of identity and it doesn't affect the convergence property of algorithm.

Least square policy iteration algorithm
The basic principle of least square policy iteration is it uses the policy iteration and LSTDQ together. It is off-policy algorithm and for generating the policies it uses the same data in every iteration. Figure 3 shows the least square policy iteration algorithm. In Least square policy iteration algorithm, linear architecture is used for the approximation of state-action function. .
The maximization of approximate values over all actions in A gives the greedy policy π at any given state s. ω is the weight vector and φ is the basis function, argmax shows the points at which function value is maximized.
Basis function φ and a set of parameters are used to represent policy π. Policy π along all sets of samples is fed in LSTDQ. LSTDQ is used to determine the policy π and give output � for each � � of each sample set (s, a,� � ). In every iteration samples of source D can be used for the call to LSTDQ. In each iteration of LSPI, the initial set of samples is reused and between iterations D is updated. Approximation architecture has parameters ω which is used to represent the policies and value functions. According to this, least square policy iteration is used as iteration in parameters ω. The generic bond is applied on policy iteration algorithm. LSPI also remove the error in architecture of actor-critic from the actor part.

Discussion
There has been rich literature on solving unit commitment problem while considering different constraints. The algorithms and models are different even if they lie within the same category. [10][11][12] used augmented lagrange relaxation, sequential lagrange and MILP, lagrange relaxation decomposition and genetic algorithm, augmented lagrange relaxation, decomposition techniques (block coordinate descent and auxiliary problem principle), dynamic programming stochastic lagrange relaxation methods. The execution time and number of time stages varies linearly in lagrange relaxation. Dynamic programming with heuristic techniques is very efficient during imprecise hourly loads. Least square policy iteration is a new model free algorithm and is mainly used for the minimization of curse of dimensionality. It is good in eliminating the error, it has no parameters for tuning. Unlike other algorithms LSPI collect sample at one time and reused it at every iteration.

Conclusion
In this paper, least square policy iteration is presented as a new algorithm for solving curse of dimensionality in unit commitment problem. Least-squares policy iteration (LSPI) learns fixed policy approximation of state-action value function. First LSPI calls for least square temporal difference Q (LSTDQ) then LSPI uses the results of LSTDQ for making policy iteration algorithm. It is recommended that LSPI can be applied on UC problem as it converges relatively small number of trials with randomly selected actions. However, LSPI is a model free algorithm and there is no risk of overshooting as LSPI doesn't take gradient steps. Also, the function approximations are better used by LSPI with faster converge.
Numerous studies recommended the LSPI but further studies should be carried out on implementation of this algorithm on real world unit commitment problem. However, the extension of this work is suggested for the future to apply this algorithm for a specific case study and validate the algorithms. Also, the comparison with literature should be considered in future in order to get meaningful outcomes applicable for that particular application.