Multi-agent deep reinforcement learning concept for mobile cyber-physical systems control

. High complexity of mobile cyber physical systems (MCPS) dynamics makes it difficult to apply classical methods to optimize the MCPS agent management policy. In this regard, the use of intelligent control methods, in particular, with the help of artificial neural networks (ANN) and multi-agent deep reinforcement learning (MDRL), is gaining relevance. In practice, the application of MDRL in MCPS faces the following problems: 1) existing MDRL methods have low scalability; 2) the inference of the used ANNs has high computational complexity; 3) MCPS trained using existing methods have low functional safety. To solve these problems, we propose the concept of a new MDRL method based on the existing MADDPG method. Within the framework of the concept, it is proposed: 1) to increase the scalability of MDRL by using information not about all other MCPS agents, but only about n nearest neighbors; 2) reduce the computational complexity of ANN inference by using a sparse ANN structure; 3) to increase the functional safety of trained MCPS by using a training set with uneven distribution of states. The proposed concept is expected to help address the challenges of applying MDRL to MCPS. To confirm this, it is planned to conduct experimental studies.


Introduction
With the advancement of deep reinforcement learning methods, more and more complex problems are falling into the area of interest of researchers. One of the newest and actively developing areas of deep reinforcement learning is multi-agent deep reinforcement learning (MDRL) of multi-agent systems (MAS). The interest in MAS is due to the higher efficiency of decentralized problem solving using MAS in comparison with similar centralized methods [1][2][3]. MAS are used in such areas [4] as manufacturing [5], transportation, smart homes, robotics [6][7][8][9][10], aviation, infrastructure, medicine [11], etc. An important problem in the development of MAS is the ambiguity and non-triviality of the synthesis of behavior policy of individual agents according to the specified MAS targets. At the same time, single-agent deep reinforcement learning (SDRL) has established itself as a powerful and versatile tool for solving intellectual problems at a level comparable to that of a human [12][13][14]. Taken together, these factors determine the relevance of the development of SDRL for use in MAS in the form of MDRL.
The two main challenges for MDRL are learning convergence and scalability [15]. The scalability problem also includes the problem of changing the number of agents during the MAS operation. This problem is of particular importance in the context of such a type of MAS as mobile cyber physical systems (MCPS). MCPS, such as groups of UAVs [16] or mobile robots, in comparison with other types of MAS are characterized by functioning in a non-deterministic environment with a relatively high probability of failure of individual agents, instability of the bandwidth and topology of communication channels [17], etc. MCPS also has higher requirements for the reliability of control methods [17,18]. An error in the process of calculations [20,21] can lead to damage and failure of MCPS agents.
MCPS also features: complex dynamics of functioning; a large number of possible states of the system; high unit cost of agents' computing power. MDRL is an artificial intelligence method based on the use of ANN, the features of which are: difficulties in interpreting learning outcomes; high computational complexity; poor scalability.
The listed features of MCPS and MDRL have led to the emergence of two architectures of MCPS control systems using MDRL: 1) with the calculation of ANN inference on a computing system external to the agent; 2) with the calculation of ANN inference on the on-board computing systems of agents.
In an inference-computing architecture on an external computing system, agents are equipped with a microcontroller and a simple multi-core GPU. Inference calculation is performed on an external high-performance server for all agents at once. With this architecture, it becomes critical that agents communicate with the base station.
In an inference-computing architecture, agents are equipped with high-performance graphics processing units (GPUs) such as Nvidia GTX 1080 or Nvidia TX2. Calculation of inference using onboard equipment makes it possible for agents to work autonomously in the absence of communication with the base station. The high cost of the GPUs used leads to a significantly higher cost of individual agents and the entire MCPS.
Thus, the following issues arise when applying MDRL to MCPS: 1. Poor scalability of existing methods leads to a large amount of computational resources, a long time and / or high cost of the MDRL process with an increase in the number of MCPS agents.
2. High computational complexity of the ANN inference used to control MCPS agents requires an increase in the computational power of the agents, which leads to a significant increase in their cost.
3. Low functional safety of trained MCPS leads to a significant limitation of their areas of use.
To solve the above problems, the article proposes the concept of a new MDRL method based on the MADDPG method. The article is structured as follows. Section 2 reviews the existing MDRL methods and problems, substantiates the use of the MADDPG method as a prototype, and discusses effective solutions applied in areas adjacent to MDRL. Section 3 analyzes the problem areas of the MADDPG method and describes the proposed concept for its improvement.

Literature review
According to the approximation object, MDRL methods can be divided into value-based and policy gradient. According to the conditions of training, MDRL methods are divided into collaborative and independent.
In value-based methods, the object of approximation is the value function; in gradient methods, the object of approximation is the policy of decision making with or without a value function. Value-based methods [22][23][24][25][26][27][28][29] are represented mainly by methods based on the DQN (deep Q-network) learning algorithm [12] and its modification DRQN (deep recurrent Q-network) [30]. The main disadvantages of value methods are: -relatively low training efficiency [31,32]; -focus on working with a discrete space of actions. The gradient method of independent learning is the FTW method [33]. The FTW method is an adaptation of the single-agent deep reinforcement learning method based on the actorlearner IMPALA structure [34]. It was shown in [35] that adapted one-agent gradient methods in a multi-agent environment have weak convergence. This phenomenon is due to the fact that the reward received by the -th agent depends not only on his action but also on the actions of other agents { | ≠ }; therefore, the probability of obtaining the correct direction of the gradient decreases exponentially with an increase in the number of agents. To solve this problem, the work [35] proposed a gradient method of joint learning MADDPG (Multi-Agent Deep Deterministic Policy Gradient). It was shown in [35] that the MADDPG method is more efficient than the independent learning methods based on the DQN, DDPG, and TRPO algorithms. Gradient collaborative learning methods have the following disadvantages:  the need to ensure the equality of the number of agents at the learning stage and the number of agents at the stage of functioning ;  the complexity of scaling due to nonlinearly increasing computational complexity of the learning process with an increase in the number of agents.
The value-based MDRL methods are applicable only for tasks with discrete set of actions. Gradient-based MDRL co-learning methods are applicable to continuous-action tasks and have good convergence, but poor scalability. Gradient MDRL methods with independent learning of agents have good scalability, however, they have problems with overcoming the nonstationarity of the environment [15].
The listed features of the MDRL gradient methods with joint and independent training of agents determine the relevance of the development of hybrid MDRL gradient methods, which simultaneously have high convergence and scalability. In this paper, we propose such a hybrid method based on the gradient method of joint learning MADDPG and the concept of parameter separation [23]. The concept of parameter separation is to use the same ANN to transform signals from different agents.
Despite the practical results obtained with ANN, the state of the ANN in the learning process remains uninterpretable due to the large number of elements and complex structure. Understanding how ANN learns remains one of the central problems in machine learning [36]. This problem complicates the choice of the optimal configuration and structure of the ANN. In practice, the ANN configuration is usually determined empirically. Some initial value is selected based on expert assessments, then training is performed. If specific signs are found, the current configuration is deemed unsuitable and changed as necessary [37]. There are methods for the automated synthesis of not only the configuration, but also the architecture of the ANN [38], but these methods require large computational costs at the training stage.
ANN structure optimization methods have become widespread in such a form of machine learning as supervised learning. An urgent task is to study their applicability and effectiveness when used in MDRL. A review of various methods for optimizing the ANN structure is presented in [39]. Also in [39] the difficulties of direct comparison of the set of existing methods with each other are shown. In this article, for application in the concept, the SET method for optimizing the ANN structure, proposed in [40], is considered. The essence of the method is to use an evolutionary algorithm to dynamically change the ANN structure during training, instead of using an unchanged predefined ANN structure. To do this, instead of a fully connected ANN structure, a sparse one is used. Due to the sparseness, the used ANN structure has the ability to change (evolution), requires storing fewer weights in memory and performing fewer operations. During the learning process, the ANN structure is periodically modified, during which some of the links with weights close to zero are removed, and new ones are randomly created instead. It is shown in the work that, despite the simplification of the ANN structure, its performance remains at the same level as compared to the fully connected ANN, and in some cases even has a greater convergence of training.
In contrast to such a field of deep learning as pattern recognition, which uses a large number of different architectures [41], deep reinforcement learning methods mainly use feedforward neural networks [42] and recurrent ANNs [43]. In this paper we consider only feedforward ANNs.
The need to comply with the requirements for functional safety led to the emergence of a class of methods of "safe" reinforcement learning. This area of research was reviewed in [44]. Safe reinforcement learning can be defined as a decision policy optimization process that maximizes the average value of reward in tasks in which it is important to ensure reasonable system performance and/or adherence to security constraints during learning and/or functioning processes. The functional security of the MDRL results considered in the work can be considered as a kind of safe training, aimed at observing security constraints during the functioning of the trained system. In [44], a fairly large number of secure singleagent reinforcement learning (SRL) methods are given, while secure MDRL remains a poorly researched area [25]. In this paper, we propose solutions that, according to the classification in [44], belong to the class of methods for studying the space of states directed by risk.

Materials and methods
A schematic diagram of the MDRL process model for a cooperative MCPS with a shared reward is shown in Figure 1. The MDRL process includes two conditional, continuously alternating stages, indicated by dashed lines: the learning stage and the optimization stage of agent management policies. The rectangles indicate the entities that transform the information, the lines indicate the transmitted information. Each -th MCPS agent has its own control policy that transforms the observed state of the MCPS operation environment into action taken by the -th agent. In the case of MDRL, each policy is approximated by an actor ANN. Let us denote the set of all agent control policies as MCPS control policy . Within the functioning stage ( Fig. 1), MCPS agents observe the state of the environment and take an aggregate action = { | = 1, ̅̅̅̅̅ }, where is the number of agents in MCPS. The environment moves to the next state, and the agents receive a reward determined by the reward function. The reward function simulates the purpose of the MCPS in MDRL. The experience gained at the stage of functioning in the form of tuples ( , , , ′ ), where ′ is the next state of the environment, is written into the training sample, which is then used at the stage of optimizing agent management policies.
The implementation of agent management policies optimization stage is determined by the considered MDRL methods. In the MADDPG method, an additional critic ANN is used to assess the effectiveness of agents' actions ( Fig. 1). Within each step of optimizing agent management policies, the learning algorithm extracts minibatch from the training sample, on the basis of which it optimizes the weights of the critic ANN, and then the weights of the actor ANN group.
The computational complexity of the MDRL process increases with scaling due to the increasing complexity of the ANNs used (Fig. 2). The information about the status of all MCPS agents is provided to the inputs of actor and critic ANNs. Accordingly, with an increase in the number of MCPS agents, the number of inputs and connections of the actor and critic ANNs increases.  Within the framework of the proposed concept, it is assumed that agents do not need information about all other MCPS agents for efficient functioning, but only information about nearest neighbors. According to this hypothesis, we propose to modify the MDRL method as follows: 1. Perform centralized training of not all MCPS agents, but only a certain group of agents, < .
2. At the stage of functioning, supply information only about the nearest neighbors to the input of actor ANN.
The success of using sparse ANNs to reduce the computational complexity of inference when solving visual information processing problems on mobile devices requires studying the applicability of these solutions for MDRL. The main essence of these solutions is the use of sparse ANNs and periodic optimization of their structure. Based on this, it is proposed to modify the MDRL process as follows: 1. Use sparse ANNs for the actor ANN.
2. At the stage of optimizing agent management policies, optimize not only the weights, but also the actor ANN structure.
The low functional safety of MDRL methods is due to the following factors. 1. Optimization of ANN is carried out on the basis of data sets obtained from the training sample. Generation of a training sample is performed by placing agents in some initial state and its subsequent functioning. In existing methods, the initial state is chosen with a uniform probability distribution among all possible states. An illustration of a similar approach for MCPS involving one agent is shown in Figure 3, a. The state of the agent marked with a cross is illustrated by its position on a two-dimensional plane. The state corresponding to the purpose of the MCPS is indicated by a white circle. A condition that is dangerous, and the achievement of which is undesirable, is indicated by a black circle. The density of the crosses in the figure reflects the probability density of the agent hitting a given point at the beginning of the episode. In Figure 3, a, the probability density is uniform; in Figure 3, b, it is higher next to the dangerous state. When generating an initial state with a uniform distribution, the probability of an agent getting into a dangerous state during training is relatively small. Therefore, the experience of functioning in states close to dangerous ones will be relatively low. To solve this problem, it is proposed to generate the initial state with an uneven probability density, which is higher in states close to dangerous ones (Fig. 3, b). It is expected that this will enhance the "experience" of MCPS operation in near-dangerous conditions and enhance the functional safety of trained MCPS during the operational phase.
2. In the MADDPG method, training is considered complete after the value of the average reward is set at some constant value. It is proposed to use a decrease in the probability of dangerous situations to a certain threshold value as a criterion for the completion of training.

Conclusion
The use of MDRL methods in MCPS is fraught with a number of MCPS-specific problems:  MDRL methods have poor scalability;  ANN inference has high computational complexity;  trained MCPS has low functional safety.
Based on the analysis of publications on MDRL, as well as in areas adjacent to MDRL, in this work we proposed a new concept of MDRL, which includes the following ideas: 1. To increase the scalability of MDRL, it is proposed to supply to the actor ANN input information not about all other MCPS agents, but only about nearest neighbors.
2. To reduce the computational complexity of ANN inference, it is proposed to use sparse ANN structures, as well as to optimize them at the stage of optimization of MCPS agent management policies.
3. To improve functional safety, it is proposed to use a training sample with uneven probability density, which is higher in conditions close to dangerous.
It is expected that the proposed solutions will solve the listed problems of applying MDRL in MCPS.