A generic metadata management model for heterogeneous sources in a data warehouse

. For more than 30 decades, data warehouses have been considered the only business intelligence storage system for enterprises. However, with the advent of big data, they have been modernized to support the variety and dynamics of data by adopting the data lake as a centralized data source for heterogeneous sources. Indeed, the data lake is characterized by its flexibility and performance when storing and analyzing data. However, the absence of schema on the data during ingestion increases the risk of the transformation of the data lake into a data swamp, so the use of metadata management is essential to exploit the data lake. In this paper, we will present a conceptual metadata management model for the data lake. Our solution will be based on a functional architecture of the data lake as well as on a set of features allowing the genericity of the metadata model. Furthermore, we will present a set of transformation rules, allowing us to translate our conceptual model into an owl ontology.


Introduction
Use The main role of the decision-making system is to help decision-makers to effectively broaden their strategic decision-making within companies. To achieve this goal over the past 30 years, these systems have continued to evolve their architectures to effectively adapt to the evolution of data. Indeed, the volume, variety and speed of data production pose a real challenge to traditional storage systems, especially for data warehouses. A data warehouse is a structured, nonvolatile historical storage system that makes data organized and oriented by subject for analytical purposes [1]. However, the data warehouse lacks the capacity to handle semi-structured and unstructured data, which is a typical feature of big data [2]. As a result, the concept of the data lake has been introduced into the business intelligence architecture as a big data storage solution, capable of serving as a heterogeneous data source for the data warehouse. The data lake applies the "schema on read" strategy, from which the data is stored in a centralized repository in raw format and does not undergo any structuring, allowing for flexibility in the processing of the data, as well as acquiring a high speed of data ingestion [3]. However, the lack of schemas for the data can easily make the data lake incomprehensible and inaccessible [4]. For this reason, metadata management must be applied by the user to efficiently extract and analyze data. * Corresponding author: o.oukhouya@uiz.ac.ma Metadata management can be implemented in the data lake either by a model or an implementation [5]. Both need a good design strategy to discover, govern and exploit the data stored in the data lake. Indeed, such a system relies mainly on 1) a good architecture of the data lake to allow to efficiently capture the metadata, 2) a classification of metadata for well-organized metadata 3) a set of functionalities that the system must ensure to manage traceability, confidentiality, quality and aggregation of data. These metadata features help structure and contextualize the data stored in the data lake. In addition, they allow the metadata model to be generic in the face of different use cases. Indeed, from the various metadata management model works [5] [6], there are so far 8 key features used to design a good metadata management system, namely, Semantic enrichment, data polymorphism, data versioning, usage tracking, categorization, similarity links, metadata properties, multiple granularity levels. Concerning our work, these features are not sufficient in the situation where the data lake is used as a single source for the data warehouse. Indeed, the dynamics, the volume and the heterogeneous Big Data lead to the evolution of the schema of the sources, which affects the schema of the data warehouse. In addition, a change and adaptation must be applied to the data structure in accordance with changes in the analytical requirements of decision makers. Therefore, the "schema evolution" functionality is important in a metadata management system.
The objective of this article is to explore the data lake that can be efficiently exploited by the data warehouse by developing a complete conceptual model of metadata, integrating the 9 features and covering all areas of the architecture of the data lake. This article is divided into five sections. Section 1 gives a brief introduction. Section 2 presents the analysis and comparison of the works examined, this section is organized into 3 subsections: categories of data lake architectures, typologies of metadata management and models of metadata management. Section 3 presents our contribution to the architecture of the data lake, and the conceptual model of metadata management. Section 4 propose a Translation rule of the conceptual model in OWL ontology. Finally, we summarize our work with a brief conclusion and future perspectives.

Related work:
The use of the data lake as a heterogeneous data source Big Data takes place in the context of the modernization of data warehouses. This modernization is based on a hybrid architecture between these two technologies, allowing the data warehouse to gain scalability, increase speed and flow, reduce costs as well as the unification of BI and IA analyses. For the data warehouse to gain these benefits, the Data Lake must use metadata to solve the problems of contextualizing and structuring data. To do this, the metadata management model design requires 1) a data lake architecture to extract the metadata effectively and 2) a metadata classification to meet user requirements. In this section, we review, on the one hand, the data lake architectures, and on the other hand, the classification and the metadata management model. In addition, a comparison will be made on this work to extract the positive and negative points of each approach. This section is organized as follows: 2.1 data lake architectures, 2.2 metadata classifications and 2.3 metadata management models.

The Architecture of data lake
We have detected two visions of functional architecture. Multi-zone distinguished by 4 zones, raw data zone, process zone, access zone and governance zone [11]. The second vision is merged with the notion of data pools with 5 levels, raw data, analog data, application data, textual data and data archiving [12]. After we have roughly presented the different architectures of the data lake in the literature, we will be most interested in multizone architectures (technical and functional) [8,[11][12], because they are better suited to the definition of the data lake [13]. This architecture [12] classifies the raw data of the transient data pool according to three categories analogous data pools for semi-structured data, applicative data pool for structured data, and a textual pool for unstructured data. These pools are equipped with a processing, presentation and analysis mechanism. Once the data is no longer used, they are saved in the 5th archived data pool. This architecture has the advantage of facilitating and accelerating analyses, but it suffers from the loss of raw data because it is deleted once transferred from the raw pool to the other pools. Moreover, this absence of raw data is contradictory with the concept of the data lake, which ingests and stores the raw data for a future request for analysis by the user. As for the architecture of [8], the data is ingested in its raw format in the transient loading zone under basic quality control. Subsequently, this data undergoes a process of successive processing in 3 areas; the raw data area integrates and structures the data. The trusted zone cleans and normalizes the data. The Discovery Sandbox Zone contains data dedicated to analysis. The consumption zone allows experts to explore the data and the sixth zone is specific to govern the data lake. This architecture is perfect for the context of the data lake. However, the transit of data in the six zones creates problems related to the traceability of the data. Concerning the latest architecture [11], the data is stored in their native formats in the raw data area either in batch or / and in real time. Afterwards, at the level of the process area, the raw data is processed according to the analytical needs of the users. This area provides processing, such as aggregation, selection, join and projection. Once the data is processed, they are all transferred to the access zone, which is responsible for scanning and assigning access to the data. The last area is governance, which applies to all other areas. Its mission is to manage metadata, ensure the security, quality and life cycle of data. This architecture is advantageous by contributing to [8], because the raw data are stored in a specific area; and by contributing to [12] by controlling the lineage, because the data passes between 3 layers only. However, access to data with this architecture is only possible for data stored in the process area. Based on this comparison, we note that each of these architectures has limits and that a new architecture is needed to solve the following problems: the data lineage caused by the number of zones, deletion of the raw zone data, and data access performed only from a single area.

Topology of metadata
We have distinguished three typologies used to classify metadata in the data lake: 1) The first typology was drawn from the context of data warehouses, described by the authors [14] by three categories: • Technical metadata: groups together the type of data, the format and the structure. • Operational metadata: manages news on data processing and lineage. • Business metadata: includes governance processes, trade names and descriptions.
2) The second typology was used for the 1st time in the context of the data lake by [20], and later completed in the work [11]. Metadata is classified into two categories: • Intra-metadata: identify and classify each dataset by 7 subcategories: the characteristics of the data, definition, quality, security, navigation, access and lineage. • Inter-metadata: describe and classify the relationships between data sets by four types: data set containment, provenance, logical cluster and content similarity and Dataset containment. 3) This third typology adds a third category to the previous classification as well as a change on the use of the dataset, which has been replaced by the notion of object [5]. Their description is as follows: • Intra-object metadata: refers to metadata relating to a specific object. Four types are distinguished: properties, summaries and previews, version and presentation, semantics of metadata. • Inter-object metadata: Describes the relationships between objects by three types of metadata: object groupings, similarity links, and kinship relationships.
• Global metadata: contextualizes all the data stored in the lake in order to optimize their analysis. This category distinguishes 3 types: resources After examining these three classifications, we notice that each category has limitations. For example, the first classification does not cover the relationships between datasets, which is an important aspect of analysis in the data lake. In the third classification, when grouping objects, one must pay attention to the functions chosen as this impacts the categorization of sources and types of data. However, we find the extensibility of the second classification more interesting than the other two, and for us, the use of datasets is more expressive for the data lake than the objects.

Managing metadata in the data lake
The first work [15] proposes a graph-based metadata model to unify and integrate heterogeneous data sources into a data lake. The particularity of this model is that it gives a partial structure to unstructured data before integrating it into the data lake. The metadata functionalities proposed by this model are mainly and generation of data semantics, definition of the similarity relationship between different data as well as the possibility to keep several representations of the same data. In [16], the authors propose a conceptual metadata schema described by a UML class diagram and validated by an implementation for the metadata management system on two DBMSs including the relational and the NoSQL graph. This diagram was presented based on an intra-metadata (for each dataset) and inter-metadata (presence of relationships between datasets) classification that includes all processes and exploration in a data lake. This classification contains all the major functionalities of a metadata management system; therefore, the authors excluded the description of the data context. Another work [5] describes a metadata management model covering the whole functional architecture of a data lake. Based on a graph model, this system uses the notion of object instead of dataset and a classification into three types, intra-object, inter-object and global. In addition, it allows contextualizing the data in order to facilitate its analysis. More recently, this approach [17] consists of a base model capable of integrating any fundamental entity and relationship to provide any metadata management use case. In addition, this model uses 3 extensions, each of which handles metadata according to: 1) the different fields of the data lake, 2) the finest level of granularity, and 3) the category of the metadata. The authors agreed that their model is more generic by contributing to the work [5]. This is due to the absence of granularity in their model.For this reason, the authors presented an evolutionary version of their model named GOLDMEDAL [6].This model is based on the principles of generalization of the 3 concepts seen in [7], namely; version and representation, transformations and updates, and finally link similarity generalized by the global concept of link. Table 1 below summarizes this study by contributing the phases covering metadata, the type of classification used, the 8 functionalities of metadata, namely, Semantic enrichment (SE), data polymorphism (DP), data versioning (DV), usage tracking (UT), categorization (CG), similarity links (SL), metadata properties (MP), multiple granularity levels (GM).We have also summarized this study by the implementation model of the metadata management system.
From this table we can see that the model [6] is the most generic compared to the other works with all the functionalities, unlike the model [15] which is the least expressive for the management of metadata with four functionalities. The other models have seven features; [17] does not support data version, [5;16] does not cover granularity. However, the evolution of data support in [16,[5][6] covers only the aspect of the transformation of the data from one version to another. Moreover, the evolution of the structure of the data during the ingestion is not supported by these works. These two concepts have always been confused in the literature because they are closely related, but each has its own approach. Indeed, the evolution of a diagram is only applied if there is a transformation between two versions of data allowing for only one current version of the data schema. While the data version transfers the data from the existing schema to a new schema in order to apply the necessary modifications in it. Therefore, after clarifying the two concepts, we see that the models examined above are not generic enough to handle a use case on the evolution of data schema. This theme has been widely investigated in the context of data warehouses mainly because the data schema must evolve over time to meet the new requests of decision makers, as well as adapt to the heterogeneity of data sources, unlike the data lake in which research is scarce. Indeed, this work presents the "constant" tool for the management of metadata in the data lake [18]. The architecture of this system is based on 3 layers: ingestion, maintenance and query. The first layer ingests the data raw from heterogeneous sources for native data storage in the data lake. This layer also includes extractors capable of detecting and adapting to new data sources. The second layer enables metadata discovery and matching, and the final layer provides quality control and user interference through a Request Rewrite Engine. In another work [19], which is in the context of using the data lake as a heterogeneous source of the data warehouse, the authors present a metadata management model to ensure semantic enrichment, categorization of data, the properties of the metadata, the granularity of the data as well as the evolution of the schema. The work focuses on the use of 5 algorithms to adapt the structure to the changes detected at the level of: data sources, metadata properties, dataset, data attributes, and finally at the relationship level. It should be noted that we limited ourselves to studies processing the evolution of the schema (SEv) in the context of the management of metadata in the data lake. The table 2 below compares all of the work seen in this subsection with respect to the functionality supported by the model or the metadata system, by using the features as a basis for comparison The table 4 confirms that so far, no model supports all of the 9 features, and that despite the importance of integrating schema evolution into metadata management, it is neglected by most of the work. In the next section, we will propose our conceptual model of metadata management, making it possible to express genericity in the data lake with the 9 functionalities already seen previously.

The Proposed Metadata Management Model : Architecture & Conception
As we saw earlier, the design of a generic metadata management model is based on a reliable data lake architecture, and metadata classification to meet user requirements. We also studied that a metadata management model is said to be generic, when it covers a set of features capable of handling any use case in the data lake.
In this section, we will present our contribution by contribution to 1) the design of a data lake architecture covering all the limits seen in subsection 2.1, 2) a generic metadata management model covering the set of functionalities seen in subsection.

Functional architecture of the data lake
The study that we established in the present section revealed that the current architectures suffer from the following limitations: The absence of the raw area to store and save the raw data, the loss of traceability on the data caused by an excessive number of Data Lake areas; data access is only performed from a single data lake area. To overcome these limits, we suggest in Figure 1, our functional architecture of the data lake, which contains 5 zones, namely: • Raw data management area: it allows for integrating and ingesting data in the data lake in native format.
• Data processing area: in batch or streaming, this area is used to process and transform data using processes such as join, aggregation, selection, projection.
• Data Trust Zone: This zone is used to store highly clean data. It can be used as a source to feed a data warehouse.
• Data access area: this area allows the user to access all areas of the data lake, either to feed it from the trust area, or to apply analytical processes. • Governance zone: it is applied to all other zones. It is responsible for ensuring data security and confidentiality, data quality, metadata management and master data management.

UML Element Description Rules Transformation Rules
For each UML class transformed into an OWL ontology, the declaration axioms are defined in terms of the class, attributes, domain, and the property range of the data for the attributes [24][25]. A binary association is transformed into OWL by declaring the property object axioms, the property object domain and range for the association ends, and an inverse object property that shows the relationship between the two classes [26]. According to [27], if the end-of-association names are not defined, the class name is used with lowercase letters.
It should be noted that UML aggregation and composition can only be transformed into OWL as binary associations. This approach loses the specific semantics related to composition or aggregation, which is untranslatable in OWL.

Conclusion
In this article, and as part of the use of the data lake as a heterogeneous source for data warehouses, a conceptual metadata management model was presented to address the issues associated with the transformation of the data lake into a data swamp. We first proposed a data lake architecture covering the limitations of the architectures proposed in the literature. Then, based on the inter and intra metadata classification, we proposed our conceptual model of metadata management covering all the expected functionalities expected of a metadata system. Moreover, we have added the new functionality "Schema evolution" which is essential for schema change, either in the context of the data lake or in the data warehouses. Finally, we proposed a set of rules to transform the elements of the conceptual model into OWL Ontology. In our future work, and based on the conceptual model and transformation rules carried out in this article, we will build a semantic data lake, using the RDF graph as an implementation model for metadata management. This will allow us to built a semantic layer between the data lake and the data warehouse, making it possible not only to structure and contextualize the data in the data lake, but also to improve performance in OLAP analyzes and queries.