Multi-Defender Strategic Filtering Against Multi Agent Cyber Epidemics on Multi-Environment Model for Smart Grid Protection

. The growing cyber space with the developments in cyber network technologies in smart grid (SG) systems has necessitated questioning the reliability of networks and taking precautions against possible cyber threats. For this reason, defensive strategies and approaches against cyber attacks must be improved to sustain secure information flow of the network connections used in electricity generation, transmission, distribution, and consumption. This paper proposes a multi-agent multi environment deep reinforcement learning (MM-DRL) based defender response against cyber epidemics consisting coordinated cyber-attacks (multi-CAs) in the same time frame scheme to sustain security for SG networks. In this regard, the PMU-connected 123-bus system is integrated as a Markov game. MM-DRL approach is implemented for sub-environments of a typical SG system. Multi-CAs game aims to coordinate PMU signals across intersections to improve the network efficiency of a SG. DRL has been applied to data control recently and demonstrated promising performance where each data signal is regarded as an agent. Conversely, multi-CAs are self-renewing emerging causative agent of electricity theft, network disturbances, and data manipulation in SG systems characterized with wide characteristic diversity and rapid evolution. The game results show that the presented request response algorithm is able to minimize system attack damage and maintain protection duties when compared to a benchmark without request response. In addition, the performance of the MM-DRL approach compared to other developed methods is examined .


Introduction
Ensuring safe and efficient energy generation, transmission and distribution plays an important role in the continuity of the development of countries.To this end, existing electricity grids are transforming into modern electricity grids based on smart systems and renewable resources.In this context, new improvements in electrical networks based on renewable energies present advantageous impacts that are operational reliability and sustainability [1].From this perspective, the effective operation of SG systems depends on demonstrating a required energy production that controls both supply and demand, as well as contributing to the transmission and distribution layers of the electricity grid.To solve these problems, an SG has been proposed that combines self-monitoring, self-healing, pervasive control, and adaptive and island-mode mechanisms to allow energy transition from generation point to consumption.To achieve these goals, communication layers have been developed and incorporated into grid-connected systems.In a way, SG systems involve massive use of Internet of Things technology, where smart devices are placed along the energy path from generation to the end user [2].An SG system includes multi-devices such as substations, power plants, and advanced metering infrastructure, as well as complex network connections communicating with sensor readings, distributed power generation, and control units [3].It allows to automatically monitor, protect, and optimize the operation of the devices existing in the SG system.To develop an effective protection system against the threats arising from cyber layers in smart networks, the root that causes of security and privacy problems and various detection methods are investigated [4].
Protecting SG systems against line outages, false data injection attacks (FDIA), denial of service, distributed denial of service (DDoS), and spoofing attacks requires detailed study and system protection.Ding et al. [5] proposed a new probabilistic model to solve this problem.The study focused on data integrity attacks, including replay attacks, Denial of Service (DoS) attacks, and FDIAs.In [6], an integrated data-driven defense plan aiming to reduce the impact of the aforementioned attacks and to ensure the security of the transmission phase is introduced for the secure transmission of data.
In addition to cyber security, addressing and developing methods based on Artificial intelligence enables to reduce energy consumption and optimize energy use in buildings [7].Deep learning is an increasing research topic, while addressing some of the limits of some advanced control techniques such as predictive control, it has the potential to improve building performance [8].As an example of deep learning applications on SG systems in [9], to detect attacks and indicate the type of these attacks, a hybrid convolutional neural network-long short-term memory approach is proposed with particle swarm optimization tuning parameters.In addition, RL methods can provide great performance in decision making and system control because of its self-learning ability and no need for a modeling environment [10]- [12].To validate the learning ability and performance of the DRL in finding critical attack paths, scenarios with varying difficulty levels are tested [13].Also, an RL algorithm typically makes much stronger assumptions of a designed environment, typically a Markov decision process (MDP), and the goal of this settings is that maximization of future reward relative to that environment.An online intrusion detection problem is formulated as partially observable MDPs as seen in the study [14].To solve similar data attack problems, effective detection algorithms based on model-free RL frames have been developed.Deep Q-Learning (QL) algorithms are one of the RL methods used to optimize the operations performed by the system and analyze attack scenarios.In addition, data flow operations are limited by the power demand of the mains, which consider the interests of each bus and communication network.Morever, to detect outliers in the SG network, reduce overfitting, and address unbalanced data problems, an adaptive synthesis method is proposed in [15].Robust big data analytics techniques have been proposed to detect electricity theft in SG.Recent studies on the compilation of cyberattacks against Industrial Control Systems has already been conducted; however, the scope and details of the attacks are limited [16].Effects of an attack on a network and entire system, rather than focusing only on the success of the attacker or system control are evaluated.Based on abnormal electricity consumption, a high-throughput detection model is proposed for aggressive development based on the DRL method and extracting features from large amounts of data [18].In [19,20], machine learning algorithms are used to detect the intruders that cause electricity theft cyberattacks.Some, such as Li, Yan, and Naili [13] have suggested that the deep reinforcement learning (DRL)-based penetration testing framework be used to identify the most appropriate attack paths to destabilize the system and to identify critical vulnerabilities in SGs.In addition, RL based deep learning approaches are proposed to address a wide variety of complicated problems [21] that include cyber attack detection and to sustain flexibility, robustness, and scalability of the system.
A policy algorithm based on Nash Equilibrium (NE) with best response to maximize the long-term payoffs of MM-DRL game, where considers not only the multi environment but also the interaction between agents is proposed in this paper.The main contributions of this study can be summarized as follows: 1.An improved multi agent multi environment Markov game for SG network is proposed.2. The proposed method gives a high detection rate, that improves the detection efficiency and reduces the operational cost of the agents.3.This study is a methodology for the analysis of bus, line, and switch attacks, including attack trends, and key points learnt with suggested actions to improve security capability of proposed framework.The remainder of this paper is structured as follows: Section 2 presents system model, problem formulation, and proposed algorithm.Section 3 gives an analysis of the Markov game based MM-DRL approach.Section 4 concludes the paper and discusses future work.

System Model
In accordance with the prototype of the 123-Bus system as a SG network which the intrusion protection algorithm is proposed.To overcome the aforementioned difficulties, this paper proposes a model-free MM-DRL-based approach to detect attacks without prior knowledge of the dynamic model and random parameters of the data flow between PMUs units.The solution of the problem is first reformulated as a Markov game that is also a framework for multi-CAs.The algorithm of the proposed approach can be seen in Algortihm 1 table.Since  is total number of variables in the problem at each time slot, in the formulated Markov game, we consider  agents corresponding to the attacker and the defender in this paper.The system initiates the game in the condition that each node of the buses and components normally operate in fully active mode SG network.Both rivals are using active attack/defense schemes, the rivals reach the targeted point by learning and constructing a specific action sequence from their observations.First move advantage is not given to the rivals in this Markov game.Thus, player's strategy is determined by the rewards they receive in individual games.Thus, the actions of the players are taken against the action of the opponent.
The game is completed when the defender manages to protect the targeted areas and can accurately detect attacker activity in the attack areas.Based on the results of the defender's explorations such as transmission line interruptions, data losses and manipulations based on the data drawn from the PMU units, the defender adjusts an action strategy taking into account both these observations and the action sequence learned by the attacker.In the post-attack phase, the defender detects the attacker based on the strategy optimized for multiple observations and detection of successive errors in the system.
Furthermore, the main attack types exist in the study that are explained as follows: Remote tripping command injection: After an intrusion, the attack starts with the blocking out the system and sending the deceptive commands to selected nodes.Short-circuit fault is a faulty junction that may occur at some points of lines in the network.Maintenance technics of power lines involve controlling and fixing any malfunctions or disturbances that may have occurred during the system operation.Data injection is a type of data attack orchestrated E3S Web of Conferences 469, 00095 (2023) ICEGC'2023 https://doi.org/10.1051/e3sconf/202346900095when an intruder alters or modifies the measurements from PMUs misleading the computational capability of the control center.

MDP Formulations
The MDP presented in this study is a 9-bundle  =< , , (  ) ∈ , , (  ) ∈ , γ,   , () ∈ ,  > where S is defined as the discrete state space, A as the discrete action space, and :  ×  →  as the agent's reward function [22].The Markov game consists of two players including attacker and detector that is responsible for detecting the presence and traces of the attacker in the system.History ℎ   gives a way to follow true action path with belief space .For instance, the defender has a belief about   that is presented in the belief state   (  ) = [  |ℎ   ] ∈ .Players can perform a limited number of actions.Perfect recall is the ability to independently confuse one's actions by specifying a behavioral strategy in which the probability of performing an action depends on the history ℎ   = { 0 ,  0 ,  0 ,  1 ,  1 … ,  −1 ,   ,   } at time step t of the states and the action profiles.Since belief state of   = (000, 001, … ,111) model is denoted as  = [000;111].Transition probability   >1 (, , (ℎ, )) function consists of the probabilities of the recurrent state transitions 001 → 1 and 000 → 0 for (ℎ, ) = {(0,0,0)  (1,1,1)}.The attacker initiates his actions at four times: the first time to initiate the attack, the second time for the phase that was intervened by the defender but not successfully, and the third time to continue the attack, and the fourth time to stop the intrusion.The defender has a property that activates the stop action  ≥ 1 times at each game episode.Each stop of the defender can be interpreted as a defensive action against a possible attack.  is an important parameter for the defender that defines the probability of intrusion prevention increasing with each stop action taken by defender agents.The number of stops remaining for both the defender and attacker is presented as   ∈ {1, … , }.For instance, the process is in a state with probability 1 −   when attacker agents choose action  for   −    > 0.

Initiate
For condition 1: defender agents receive for stopping an attack (  > 0) means that the agents have a reward   (1, (0, )) =     ⁄ for  ∈    that has lower values in   when initiating a stop action truly effecting an ongoing cyber leak.The cost to take a defensive action (  < 0) of the defender, and its loss when being cyber leak (  < 0) effects the reward function.
For condition 2:   (0, (1, )) = 0 for  ∈    indicates that the reward is zero when the game is over and an attack against the system is over.
For condition 3:   (0, (1, )) = 0 for  ∈    means that no cost is required when the system is not under attack and the defender does not have to defend the system.
For condition 4: if no attack continues, it shows that each wrong step incurs a cost and there is a cost that decreases with the number of stops remaining and receives the reward   (0, (0, )) =     ⁄ for  ∈    .
Finally, for condition 5:   (ℎ, (1,1)) =   for ℎ ∈    indicates that the defender always takes a loss in his stride while under attack.The main goal of both players is to maximize the discounted sum of future rewards. ℎ, , and  ℎ, , are offensive and defensive strategies, respectively.The equation of the attacker reward and central reward of the defender defined as is expressed as: Initially, the attacker and the defender obtain the game values as follows: (,  ℎ, , ,  ℎ, , ) = ∑   (   | ℎ, , ,  ℎ, , ,  0 = ) ∞ =0 (2) (,  ℎ, , ,  ℎ, , ) = ∑   (   | ℎ, , ,  ℎ, , ,  0 = ) Where   () = {  | ∈ } and   () = {  | ∈ }.In this proposed algorithm with NE, a policy of an agent should be the best response to it's rival policy and steps taken.When the procedure of NE is examined in other game genres, it is seen that it has similar characteristics.Two players go through different game stages according to their impressions and the rewards they receive, and play again and again until they seize the system.When the players reach the NE point, at the same time, the players have achieved the optimal policy.A widely used solution principle to solve the game and reach the optimal strategy for each player is the closed loop NE [23].NE in a multi-stage multiplayer dynamic game is achieved by the decision making process in stages.The QL algorithm is a very effective model-free algorithm that is suitable for lowest system-wide computation cost to determine the optimum policy for both rivals.A solution concept involving a steady-state condition is a NE.As a result of the observations obtained from the players, the aim of each rivals is to choose its strategy as maximization the payoffs [24].
The best response strategy against the strategy that maximizing   is the optimal attacker strategy   * , .The best response against the strategy minimizing   is the optimal defender strategy   * , .So, if both rivals reach optimal policies, they use their best response strategy against the opponent.This is a game in which the policies of the defender and the attacker are interlinked, with no player changing their strategy that (  * , ,   * , ) is a NE [25].The main objective of a player during a MDP is to develop an optimum policy  to maximize the expected discounted reward sum, as follows: The expected discounted reward  is reformulated into the equation as: where  is the action for each agent chosen based on the current state , the next state  ′ , and the policy .For any ,  ∈ , an optimal  * policy used and formulated in this study is expressed as the Bellman equation: where V(s,  * ) is defined as the optimal value for any state .The agent tries to reach the most appropriate policy ( * ) with environmental observations when the state transition function and the reward function are known.QL is an efficient detection model used in the study [26] and the Q function is expressed as: (, ) = (, ) + ( ′ ,  * ) (8) where the total discounted reward obtained by action  in state  is denoted by  * (s, a).

Decentralized QL Convergence
When applying reinforcement learning on cybersecurity, we took into account both physical errors and attacks that occur in ordinary situations.The  value for the attacker side is defined as: and the calculated value for the attacker is calculated by the following equation seen below: where the    * value is expected long-term reward of the attacker based on the policies most suitable for state , and the Q-value is the expected long-term rewards for the defender's taking action a versus action d taken.The defender updates its' Q as follows: In state , the expected future reward of the attacker to adopt an action against it's rival is represented as    * (  ,    ).NE is a game theory concept involving several players, determining the solution in a noncooperative game where each player's policy depends on the opponent's policy [27].The corresponding smoothed best response is calculated by the following equation:

E3S Web of
When player  takes action according to the corrected best answer    , any action will be performed with    (  )> 0 based on some positive probability.So each rival can update its belief according to the Eq.( 15).
where   ∈ (0,1) is defined as a step size and not specific to any action.NE in the game can be achieved by finding pairs of strategies that are the best response to each other, also known as Individual QL.The best response for the defender and the attacker is obtained with the help of the following equation: Also, the term stochastic approximation error creates a Martingale difference sequence based on the history of the iterations.The reason why this has finite variance is the limitation of iterations.The stochastic approximation term is written with    .
* ,+1 () =   * , (s)+ () (   (  ) −  − * , ()) To predict the convergence of the set of strategy pairs to a NE, the approximate availability metric   is used [28]: The best response strategy for this game is calculated for player  with the help of dynamic programming.NE is achieved when   reaches 0. This calculation is for solving the problem of partial observability.Based on the QL and PSO-LSTM layer [9], local agent is defined as: The hidden state of LSTM is represented as .ε-greedy is adjusted for the exploration policy and the exploration rate of episode  is: The decreasing count of each episode  is represented as υ.The initial exploration rate is also represented as ϵ  .In this study, the main aim is to find the best fit to proposed algorithm.Therefore, loss function that the local reward closed to central reward that is accepted as total player performance is represented as follows: 3 Results

The winner of the game (𝒶Xd)
As can be seen from Fig. 1, the   function of the defender and attacker for communication networks, distribution, and generation units presents the tracking of the attacker trace.Attacking agents are in a position to leak data from data paths.The results of actions are shown in Fig. 3.The figure shows the number of steps (from the beginning of a game phase) until each attack paths is detected.Initial reconnaissance observations are handled at the first stage (due to initial attacks) for each agent.Some important parameters of the algorithm are presented in Table 1.Learning rate 0.0001 However, it should be noted that the proposed system and approach controls the movements and achievements of both players' players.Therefore, if the goal is to strengthen the attacker's side, there is a need to provide a stronger algorithm for the purpose of empowering defender of the system.Future research topics include exploring the ambiguity and constraints of power distribution patterns for each player in relation to sub-environments.Multiple layers complicate interactions between agents and their neighbours, making it difficult to sustain global interactions between agents trying to infiltrate different layers.The algorithm proposed in this case study has been expanded and complexed over the domains of the previous example agents described in [29], especially in terms of the dependence of approximation success on the size of sub-perimeters.Another remarkable aspect of the algorithm is its simplicity, as it does not require complex parameter estimation and is optimized with the help of the RL based multi agent multi environment Markov game model.The presence of an attacker who has captured a point within a multi-agent system can be identified as an inside agent, as it requires the identification of optimal solutions, and the presence of this agent is a key factor in addressing it in SG security.As a result, value function seen in Fig. 2 is an important parameter step towards resolving vulnerabilities in cybersecurity layers so that it becomes easier to detect attacking agents.Based on implemented a deep learning-based FDIA detection algorithm that can identify and classify multi-class attacks, in the second study, the target is to design and develop an attack defence system.This system uses a structured Power Measurement Unit environment enriched with a suitable long-term dataset.In order to further develop the FDIA detection system, which aims to achieve improved performance and efficiency, different approaches are considered to include new approaches and techniques, and the RL-based Markov game defines 3 different SG layers in the periphery and evaluates the status of attackers and defenders.In this study, we propose an adaptive security game based on a POMG combined with RL to intelligently address the issue of intrusions within the subenvironments.We present a novel framework for detecting and defending against cyberattacks targeting PMU measurements.The novelty of this study lies in reformulation of Markov game specifically to indicate attacked area.The cyber-physical attack detection studies in the literature mostly consider one layer of SG systems and this does not show the real problems and structure of the whole protection scheme.In this case study, unlike other studies, 3 different layers are handled in discrete and with a holistic approach.In this study, the attackers manipulate the data during the SG operation to obtain the necessary information to restrict the network, and the operation will respond to physical adversities such as multiple attacks and regular line outages, which emphasize the integrity of the FDIAs as a cyber-attack in the physical layers of the SG system, as in [30].It effectively works as a fault cause detection mechanism.This case study includes simulation-based studies such as to secure communication channels between PMUs, PDCs and the control center.In addition to the recent studies in the literature, the comprehensive structure of this article instantly detects faulty data in the PMU, PDC and bus layers, most of which consist of physical disruptions and interruptions, with the help of environmental observation, the traces of the attacker's movements.As discussed in [31,32], recent works in this area focus on similar parameters such as reward in multi-agent RL algorithms.Each cooperative and competitive action is determined based on multi-agent reward functions and sub-environment observations of the rivals, taking into account the accuracy of the decision and the size of the action space.Attacker agents have a wider range of action than defenders, especially given the large number of target points and the data manipulations required to capture those points.A detailed comparison based on the performance metrics of the proposed approach with the other FDIA detection studies is presented in Table 2.The proposed approach has superior results for attack detection.This approach makes a difference with a novel algorithm based on the Markov game.
The studies in the Table shows greater than 90% accuracy but other metrics have also key roles in determining the best for researchers and decision-makers in this field.In addition, there are rather important concerns for selecting better models.The first concern is each study focuses on specific IEEE bus systems such as 14, 118, and 123-Bus or a critical part of SG systems.According to the performance results of these studies, as in [34], the study has the closest result to the proposed approach.Ensemble learning techniques are examined for addressing DDoS attacks [34].As can be seen from the studies based DDoS Attacks [34] and the cyber attack combining with physical attacks suh as remote tripping and short circuit faults.In the other study, the proposed model is tested for systems with 14 and 118 buses.The models should be tested on the same systems and attack vectors for a better comparison of the two models.However, there is a second important concern for handling an effective and valid comparison.Most of the studies in the literature focus on some specific performance evaluation metrics.To compare the proposed models for FDIA detection, the same metrics should be used respectively.

Conclusion
Cybersecurity protection components are important in system states to detect failure or to detect situations where it is not in normal operating conditions.Within the results obtained by integrating the robust performance of the MM-DRL, this algorithm was added to the multi-environment system and a faulty data detection model was developed that increases the security of the cyber and physical layers of the SG systems.As a Markov game, defensive and offensive strategies are constantly being developed by optimizing the rewards RL agents receive for each situation.One of the key aspects of the Proposed Approach is that both the defender and the aggressor can make progress and develop their strategies without any outside interference.This autonomous learning process allows them to effectively adapt and respond instantly to changing situation and attack scenarios.
In order to ensure that the proposed attacker-defender game strategy achieves optimum performance, a comprehensive comparison is made with comprehensive detection in smallscale scenarios.In addition, large-scale situations seek significant performance improvements over existing approaches.In order to achieve these goals and ensure the effectiveness of the proposed strategy, the studies on the construction of the system are shown in the game stages determined in ms, in which the defender mainly determines the position of the attacker, as evaluated in the results section.

Fig. 1 .
Fig. 1.Probability of the stop action by the NE strategies in function in time (ms).

Fig. 3 .
Fig. 3. Overall rewards of defender and attacker at second during one episode.
This study is supported by the Scientific Research Projects Commission of Eskişehir Technical University under the grant of 22DRP192 and the Scientific and Technological Research Council of Turkey (TÜBİTAK) under the grant of 122E414.
function O at time-step t,   ∈  is denoted based on a random variable .

Best Response for Policy Algortihm based NE and Centralized QL convergence
The learner (aggressor) manages the learning procedure; the opponent (defensive player) acts with an optimized strategy.In addition, the policy of the defender will be optimized in line with the progressive actions of the attacker resulting in intrusion.

Table 1 .
Experimental setup of Markov game algorithm.

Table 2 .
Relative performance analysis of cyber-physical attack detection studies.