Improving reliability of state estimation programming and computing suite based on analyzing a fault tree

. Reliable information on the current state parameters obtained as a result of processing the measurements from systems of the SCADA and WAMS data acquisition and processing through methods of state estimation (SE) is a condition that enables to successfully manage an energy power system (EPS). SCADA and WAMS systems themselves, as any technical systems, are subject to failures and faults that lead to distortion and loss of information. The SE procedure enables to find erroneous measurements, therefore, it is a barrier for the distorted information to penetrate into control problems. At the same time, the programming and computing suite (PCS) implementing the SE functions may itself provide a wrong decision due to imperfection of the software algorithms and errors. In this study, we propose to use a fault tree to analyze consequences of failures and faults in SCADA and WAMS and in the very SE procedure. Based on the analysis of the obtained measurement information and on the SE results, we determine the state estimation PCS fault tolerance level featuring its reliability.


Introduction
At a supervisory control of a energy power system (EPS), the procedure of the state estimation (SE) is a basic program tool that provides the tasks of operational and anti-emergency control with reliable and quality data on the EPS current state.This information is based on telemeasurements (TMs), telesignals (TSs), and phasor measurements (PMs).The SE procedure enables to detect wrong measurements, and plays a role of a barrier for the distorted (by various reasons) information to penetrate into the EPS control subsystem.At the same time, the SE software itself is subject to technical failures, and the supervisor, making decisions, should be confident in correctness of an SE-obtained decision.In this report, we propose to use a fault tree to analyze the consequences of technical failures and faults in the data acquisition and data processing systems (SCADA and WAMS) and in the SE software.Based on analyzing the measurement information and on the SE results, we determine the reliability level for the programming and computing suite (PCS) that estimates the state, and propose some measures to increase reliability.

Energy Power System (EPS) State Estimation (SE) Programming and Computing Suite (PCS) computing environment
The EPS SE PCS is intended to obtain the EPS current state model from telemetery and telesignals arriving from SCADA and phasor measurements from WAMS.

Supervisory Control and Data Acquisition (SCADA) fault tolerance
A SCADA system includes: remote telemetry units (RTUs) installed at EPS substations to take telesignals on the switching equipment state and measurements of the state parameters, communication channels, database (DB), systems of on-line display of the state parameters, as well as the software (EMS-application suite) to process the measurement results and to form control commands for dispatching management objects.Figure 1 presents the structure of a SCADA installed at the control center of a regional network company.exchange requests between each other.For each server, provided is its own system of data transmission (DTS) and of data storage (DSS).Herewith, DSS 1 and DSS 2 constantly synchronize their data.Once 30 minutes, the data are transmitted to the historic server, at which continuous reservation is provided.
A response absence from Principal Server 1 is considered as a fault, the system switches to Standby Server 2. This occurs at minimal delays not noticeable for the user.
There are two types of failures possible: 1) hardware failures, and 2) software failures.After a failure, there is software restart at the Principal Server (in case, it was a program failure), or switch to the doubling fitments that is in the server hot reserve.Client requests go to the Principal Server, and, in case of its fault, they are redirected to the Standby Server with no request latency change.
In case of the DSS1 failure, the Principal Server switches to DSS2 with minimal delays.In case of the DTS1 fault, the incoming data stream is directed to DTS2.
Like practice shows, failures seldom arise in such systems.Software failures occur once a month, on average, hardware failures occur once a year.

Wide-Area Measurement System (WAMS) Data Acquisition (DA) Automatic System (AS) fault tolerance [1]
To solve the tasks in acquiring and storing the WAMS information, in 2009-2011, a WAMS Data Acquisition Automatic System (WAMS DA AS), whose structure is shown in Figure 2, was created and put into industrial operation.

Fig. 2. Structure of WAMS Data Acquisition Automatic System [1]
The lowest hardware, PMU, transmits phasor measurements to the system under Protocol C37.118-2008/2011 into phasor data concentrators (PDCs) for further use in calculation tasks.PMD are relayed onto a higher level of dispatching control, into the super-PDC corresponding to the control hierarchy.This architecture is simple, reliable, and perfectly suitable to solve problems in the absence of restrictions for the computing and telecommunication infrastructure.The measurements are kept in the database of own design [1].
The system servers are united in a cluster operating synchronously: they interact with each other and exchange the information with data sources and clients.The databases included in the cluster are synchronized among themselves.The fault tolerance means within the system allow to create clusters of 2 and more servers.The storage is scaled over almost unlimited number of servers (up to 65535).Failure of any two servers will not lead to information loss.On the one hand, the DA AS architecture is hierarchical, because the servers are at all the levels of the dispatcher control; on the other hand, used is the cloudy technology, when the places of data storage and their traffic routes are not anchored rigidly.This increases the DA AS fault tolerance.

Block diagram of SE PCS for the reliability analysis
For a PCS to function necessary are measurements, EPS network, and software including the following algorithms: to form the current (operational) network based on the basic scheme from telesignals; to analyze the observability of the network; to identify and detect gross errors in PMD and telemetry; • to filter random telemetry errors, i.e., to receive their estimates and to finally calculate non-measured parameters.
For the SE PCS to operate under the WAMS conditions, used is a traditional algorithm of the linear state estimation through the state vector in rectangular coordinates [2].Also, solving linear SE is successfully realized via the test equation method developed at the Melentiev Energy Systems Institute of the Siberian Branch of the Russian Academy of Sciences [3,4] that has a number of advantages over the traditional nonlinear approach [5], the main of them being a possibility to a-priori detect bad data (BD).
In [6], to analyse the software reliability, there are the factors leading to software faults: errors in the program, use of non-optimal and imperfect algorithms (for example, heuristics use), restriction for real-time functioning (the system state changes faster, than the computing cycle lasts).An interaction of several factors and hardware problems in the computing system may also lead to software faults.
The software reliability degree features the probability of its fault-free operation over a certain time interval.In [6], the following is proposed as reliability indicators: the mean number of correctly solved tasks over a certain time interval Δt1, the mean number of errors for that interval, the probability of solving the set number of tasks for the time interval Δt2, the probability of emergence of the set number (probably, limit, being a fault fact) errors for that interval, etc.Such indicators may be used for any software.Conceiving the SE PCS object domain well, we will try to develop other reliability indicators that account for the peculiarities of the solved tasks and feature the fault tolerance of this E3S Web of Conferences 25, 02006 (2017) very software.
Let us start with presenting the SE PCS, a program tool, as a technical system that is to operate safely and qualitatively.SE PCS will operate correctly, as long as all three elements are operable: measurements, the network, and SE algorithms.For the initial analysis, we make the block diagram (Fig. 3).

Fig. 3. SE PCS block diagram
From Figure 3, one can see that the bulk of the PCS components is reserved: Measurements: when no PMs, then the SCADA telemetry is used; if there are no both PM frame and SCADA snapshot, PCS can operate with archival snapshot; besides, the incoming measurements are recorded in the real-time DB (RT DB) that is also reserved; Network: the data on the diagram are stored at the DB server of constant information; at computer centers of major power facilities, a standby server and a standby DB are provided; Algorithms: algorithms for the observability analysis, for the measurement validation, and for the state estimation are reserved by alternate blocks.The SE PCS produces the fastest solution, when the PMU LSE algorithm functions.Therefore, the Observability Analysis (OA) program should reply, whether the scheme plan is PMU-observable.If the answer is negative and LSE is impossible to run, OA is performed by a set of SCADA measurements (here is the example of the OA algorithm redundancy) for non-linear SE (here is the example of the SE redundancy).For the algorithm of the a-priori validation by control equations, there is a redundancy in the form of a-posteriori validation by the state estimation remainders.
However, when analysing the reliability of complex systems, it is necessary to find, which of the elements are critical, whose serious faults affect the system operability to a greater extent, in general.Typical criticality indicators [6] are the fault probability, the severity of consequences, the element tolerance to malicious activities, the risk value due to a fault, the possibility of fault localization, the controllability of the element state during the operation, reserving, etc. Ranging the elements by the criticality degree is possible at different levels of structuring the system objects.Critical elements may be visually provided by the Fault Tree technology.

Fault tree technology
For the first time, the term "fault tree" in Russian literature was mentioned in Yu.Guk's book [7].Known since 1960s, the Fault Tree Analysis technology applied by expert systems in military aviation, then in nuclear power, and in some other industries [8], appeared a convenient means to analyse the operation capacity (fault tolerance) of any technical system or its separate complex nods.
The fault tree is presented in the form of a hierarchical structure: top level -tree root -is the addressed technical system; second level is the system indicators featuring this system; third level -system elements -is the details of system indicators; fourth level -tree leaves -is the events leading to a fault of the system operability (technological problems); lowest level is the measures to suppress the fault causes.
Figure 4 presents the SE PCS fault tree to analyse the reliability of its operation.
At the top level, there is PSC itself.The system indicators (measurements, network, algorithms) is the second level.Those indicators contain basic elements (measurements types, databases, procedures etc.).Basic elements are exposed to these or those failures or technical faults which are on the forth level of Fault Tree.At the tree lowest level, there are counter-measures written in italics.The set of counter-measures enables to calculate the indicators for the EPS SE PCS algorithm operation efficiency and fault tolerance.
The figure 4 shows the block analyser, the program determining the most vulnerable PCS spots in terms of fault tolerance.This determination is based on the statistic block stored by the calculated indicators for a certain period of time (both blocks in bold).

Fig. 4. EPS SE PCS Fault Tree
As compared with the known term "decision tree" [9], in a fault tree, the total of possible options to solve a problem decreases up to the number of options obviously threatening the operability of the addressed technical system.In fact, the fault tree is a diagnostic method offering a way out of a specific problem situation.

Fault probabilities for the Fault Tree elements
The analysing program is run several times per day after a certain time, in which N runs of the SE PCS are executed.The probability for the PCS fault tolerance is calculated like The PCS operation fault is non-entry of measurements, or the absence of the network, or the software crash, i.e., where meas Q is the fault probability of the suite measuring part, network Q is the fault of all components that are responsible for the assembly of the current network, lg a Q is the fault of the suite computing part, i.e., non-entry of the calculation results.In turn, means the total absence of measurements at the PCS input.
Equivalent scheme formation is switching of the scheme elements, which is imposing the telesignals on the basic network (scheme), combining the adjacent nods into one nod at the switched-on bus-tie switch, creating the common equivalent line for several parallel lines.Failure of forming the network at the PCS run readiness is possible at the errors downput in the algorithms for equalizing the network elements.The procedure provides an emergency shutdown, indicator of non-operability of forming the network: Procedure for the observability analysis (OA) initially was in checking the correspondence of the number and structure of measurements in the SCADAcreated snapshots to the network graph.To improve the network observability quality, a special algorithm for telemetry allocation enabled to indicate the spots poorly equipped with sensors on the network, provided recommendations on reserving sensors and measuring channels [10].At present, along with telemetry, PMs are also used in EPS; therefore, now, developed are the algorithms for the WAMS sensor allocation accounting for possible shutdowns of individual connections, failures of separate sensors, loss of separate measurements [11].
OA enables to detect the observable and unobservable fragments in the network.It is known that, by the tension vector measurements in a node, and by the current vector measurements in the offline, one can obtain a PM calculated value in a node on other end of this line.Such sensor location on the network (area), E3S Web of Conferences 25, 02006 (2017) when all the absent nodal measurements may be calculated by the available PMs, is called "nonobservability depth equal to 0" [12].If the available real and calculated PMs do not provide a completely PMobservable network(area), we consider the occurrence of the network(area), "unobservability depth" equal to 1 and more (which implies a insufficiency for the linear SE operation) as an OA procedure failure for the linear SE (from the WAMS data).
If there is no possibility to perform the linear SE, the procedure of nonlinear SE is run from the SCADA and WAMS data.An OA procedure failure for the nonlinear SE is practically impossible.
The procedure of the a-priori validation based on the test equation method is performed before the SE procedure run, and shows: the number of reliable and non-reliable measurements that arrived at the PCS input, the number of critical measurements, whose exclusion leads to presence of non-observable parameters whose errors cannot be detected; and the groups of doubtful measurements, where it is impossible to detect, which among them are erroneous.The measurements with the detected gross errors (bad data) are replaced with specified ones during the algorithm operation; thus, their dispersion values increase, showing a trust reduction concerning such measurements.It is much worse, if, among the measurements, there are critical ones that are not a part of test equations, and the doubtful ones, whose quality, at the given test equation set, is impossible to check.Therefore, we will consider a high percent of the doubtful and critical measurements present in the incoming snapshot a failure of the a-priori validation procedure: Herewith, the PCS operation does not stop, because, further, the algorithms for the robust state estimation and for the a-posteriori validation are started.A correspondence of the SCADA snapshot time tags to those of WAMS is a strict requirement for the validation procedure.
The linear SE (LSE) procedure is solved noniteratively, an indispensable condition for its start is the observability of the entire equivalent scheme through PMs.Therefore, the linear SE algorithm fault is: In our Fault Tree, the algorithms for non-linear SE and for the a-posteriori validation are presented as the counter-measures against the PCS operability faults (see Figure 4).Therefore, we do not address their fault probability, but we surely account for the quality of their operation.

Case study
It is needed to calculate the SE PCS fault tree basic element fault probability values: 1.The analyser is run 4 times per day, 1 time every 6 hours.
2. Software failures occur, on average, once a month, hardware failures occur ones a year The elements, whose operability may be improved through reservation, are presented with the i n indicator: In Table 1 One should note that the system element redundancy is not infinite, it should be guided by common sense: reservation of a system technical element is more preferable to implement through a functionally similar one, but made by another manufacturer to avoid possible recurrings of the discovered or (which is much more dangerous) undiscovered defects.Algorithm redundancy should be performed by producing software written in such programming languages that do not enable to introduce malicious distortions into the program code during a cyberattack when operating the SE PCS.

Conclusions
There are universal indicators including those created based on the waiting theory for software reliability that enable to reveal the level of its working capacity in terms of functionality and delivery of result.Application packages, standard algorithm libraries, operating systems belong to this software type.SE PCS is specialized software, and it is intended to provide quality results (based on a deep analysis of the input data) affecting the control of a technological process.It is extremely important to know, what the SE PCS behaviour under real conditions is.Therefore, it is extremely important to develop such indicators for software fault tolerance that would enable to protect weak spots of program blocks, and, thereby, to increase the PCS operability in any hostile environment.

Tolerance of the State Estimation Programming and Computing Suite Fault Tolerance of the State Estimation Programming and Computing Suite SE Algorithms by the TE Method SE Algorithms by the TE Method
Conferences 25,02006 (2017)

Table 1 .
, compared are the SE PCS fault tree basic element fault probability values and the PCS operability probability summary values.The calculation results corroborate that reserving individual elements demonstrates the PCS fault tolerance increase.E3S Web of Conferences 25, 02006 (2017) Fault probabilities for the Fault Tree elements