Log mining and knowledge-based models in data storage systems diagnostics

. Modern data storage systems have a sophisticated hardware and software architecture, including multiple storage processors, storage fabrics, network equipment and storage media and contain information, which can be damaged or lost because of hardware or software fault. Approach to storage software diagnostics, presented in current paper, combines a log mining algorithms for fault detection based on natural language processing text classification methods, and usage of the diagnostic model for a task of fault source detection. Currently existing approaches to computational systems diagnostics are either ignoring system or event log data, using only numeric monitoring parameters, or target only certain log types or use logs to create chains of the structured events. The main advantage of using natural language processing method for log text classification is that no information of log message structure or log message source, or log purpose is required if there is enough data for classificator model training. Developed diagnostic procedure has accuracy score comparable with existing methods and can target all presented in training set faults without prior log structure research.


Introduction
Diagnostics of the data storage systems is a sophisticated and important task since such equipment is now widely used for storage of the sensitive data, including data related to environment monitoring, maps and satellite photos of the earth surface, ecological monitoring etc. Damage or loss of such data can be both expensive and/or harmful for education and research activities. Modern data storage systems consist of the different subsystems and components, such as storage processors, storage fabrics, network equipment and storage media. Fault detection in such systems means not only detection of the problems in the separate components, but also detection of problems in the component interaction. Although diagnostic approaches, data-driven or model-based, using various sources of monitoring data are fairly widespread [1][2][3], there is a reliable alternative source of diagnostic information, which is not so actively used -software logs. Logs, generated by storage system components, contain a lot of specific information, considering system behaviour and component interaction [4].
Currently there is a number of recent researches in a field of computing system diagnostics, on a possible means of diagnostic information extraction from software logs [5][6][7], but they usually are based on transformation of log plain text to chains of structured events. This approach has several major drawbacks: -For each log, used in diagnostic procedure, log message structure should be painstakingly analysed in order to extract useful message templates; -For each message template, a separate parsing procedure should be developed; -To find a relation between chain of messages and a fault event, the diagnostic procedure developer should have an extensive prior knowledge about fault route cause and log message purpose and source.
Log template parsing algorithms [8][9][10][11] partially solve first problem, but the other two are hard to tackle. Alternative approach suggests using text classification algorithms for log data error labelling [12][13], but presented algorithms have some serious drawbacks: -only logs in certain format can be processed; -log analysis is performed not in a run-time, but as a process of fault investigation, so the whole log text is used for classification purposes.
-log analyzing procedure takes only single log message for parsing and fault detection. [13]

Methods
This work presents a new approach to storage system technical diagnostics, based on the fusion of modelbased and data driven approaches. Implementation of this approach implies usage of the storage system software event and system logs as a source of the diagnostics data together with the expert knowledge of the storage system hardware and software architecture to diagnose faults and prevent data loss and performance degradation.

Ontological diagnostic model
The approach, presented in current paper, uses ontological modelling as method for formal definition of the expert knowledge about storage system. Diagnostic model contains: -Storage system health model, including dependencies of storage system subsystems and components health states.
-Storage system components hierarchical structure.
-Relation between storage system parameters and states.
-Other classes and properties, used e.g. for log preprocessing.
Ontological model consists of ontological classes for the storage components, states, parameters etc., object properties for the definition of the relations between classes and data properties for the parameter values description.
Because of the fuzzy nature of the relations between monitoring data and health states of the sophisticated hardware and software systems, such as data storage systems, this paper presents a method for a definition of implicit links between parameter and failure classes of the ontological model ( figure 1). An implicit link between parameter values vector {Pi} and possible component failure types {Yk} consists of ontological object properties «described_by» and «solves_with» and an ontological implicit relation class, which contains the external programming means for state diagnosis.

Fig.1. Implicit relation
In case of the storage processor logs as a source of diagnostic parameters, such implicit relation class contains model of the text classification algorithm [14][15], pre-trained on the training set.

Implementing
diagnostic procedure based on ontological model  Presented diagnostic procedure can be easily adopted to any storage system with the similar structure by redefining the rules for topology and diagnostic data extraction. The model structure redevelopment is not required. Current implementation of diagnostic procedure uses system and event logs located on storage processors as a data source for the diagnostic parameters.
On the first step, the ontology model is created using expert data in an external ontology editor, such as Stanford Protégé [16] software.
On the second step, diagnostic service generates a simplified graph form of the model. Diagnostic service creates vertices and edges of the oriented graph from the classes, data and object properties of the ontological model and their instances from current storage system topology providing service, according to ontological model class and property definitions.
After that, diagnostic service stores generated model into a graph database [17][18] and queries it in the diagnostic parameters processing loop.
For the simplified graph representation of the ontological model, nquads format is used, where each two vertices are connected with the edge that can have a weight object, containing some king of additional information, such as, e.g. default value of the diagnostic parameter.

Storage system event logs as a source of diagnostic data
Enterprise level unified storage system usually contains a set of storage processors, used to control the access from external network to actual disk drives. Storage processor software consists of a number of services, all of which generate one or more logs, used to store information about service events, including description of different faults and errors occurred in service runtime. Usually, the huge size of generated data makes log analysis impossible without some sort of preprocessing.
Approach presented in current paper implies application of the natural language processing methods to log mining in order to label log fragments as related to different system faults.

Storage system log types
Since storage processors in analysed storage system are linux-based, software logs are mostly located in "/var/log/" system folder. All software logs could be classified according to software purpose and type of the logging data: -OS logs.
OS logs usually contain messages about system activity, such as authentication, system daemon management, system error events, etc., e.g. authorization log, daemon log, debug log for Ubuntu; warn, messages, lastlog for OpenSUSE and Suse Linux Enterprise Server. Table 1 contains some most common OS log types. Contains messages about errors, occurred in OS Some applications like web-severs or clustering software also create their own logs in common log folder. Each application can have one or more dedicated logs in a corresponding folder inside "/var/log/". In addition, there is a separate group of logs, called "Non-Human-Readable Logs", processed only with the help of special software utilities.
Ontological diagnostic model may contains all possible types of logs that are human-readable and contain messages with timestamp, because these are the only restrictions of the presented approach based on NLP text classification.

Storage system log message format
Currently, there are many different log formats, most of which are more or less related with the Syslog format. Syslog is defined by several major specifications, such as Syslog -Syslog Protocol (RFC 5424), BSD syslog (RFC 3164), Common Logfile Format (RFC 6872). In addition, RFC 5427 describes major text rules for Syslog message classification, such as category types and severity levels of log message.
According to a syslog specification, message consists of three major parts -header, structured data area and message area. Even if the log is not exactly syslog-compatible, it usually contains all or some of the following common fields: -TimeStamp. This field contains information about log message event occurrence time. There is no common format for the timestamp, so it can be like «Feb

Results and Discussion
Presented approach text classification process consists of following steps: -Text preprocessing. Text preprocessor removes notalphabetical characters, splits tokens using whitespace delimiter and applies Porter stemming algorithm. -Word embedding [19]. Word embedding algorithm transforms each token to a set of numerical values, based on term frequency and inverse document frequency ranks.
-Additional parameters calculation. Additional parameters are calculated automatically based on text properties. All additional parameters are presented in table 2.
-Text classification. Text classification is performed with Random Forest [20] machine learning classification algorithm. In the process of evaluation, the described dataset was splitted 10 times randomly in training and testing subsets. Average fault detection rate on testing subset is 0.74, which is comparable with currently existing approaches. Separate evaluation of the algorithm with and without additional parameters shows, that without additional parameters it is around 20% less accurate.
Although the presented approach has an apparent drawback that it requires a decent amount of training data in order to add any new fault definition to the diagnostic model, it's advantage is that it does not requires any log structure analysis. Developed diagnostic procedure implementation is embedded in storage system software and located on one of the storage processors ( fig.3).

Conclusions
The developed approach based both on the diagnostic modelling and log data mining has some important advantages over traditional model-based and datadriven approaches: -External procedures (pre-trained models in case of log processing) allow defining implicit relations between parameters and states of the storage system that cannot be defined using the ontological object and data properties.
-Model creating process is much more simple and straightforward.
-Model adoption process requires only topology provider and data sources configuration.
-Diagnostic procedure can rely on machine learning algorithms and expert knowledge about storage system structure and faults.
Proposed log mining algorithm can be implemented for any plain-text log type, without prior knowledge of the log purpose or structure, if its messages have timestamps and are correlated to some faults, occurred in storage system.
A possible future improvement of the text classification quality may include introduction of the more complex word embedding models, such as BERT embedding[].