The problem of determining the threshold for statistical analysis by the POT method: Application to wave data on the Moroccan Atlantic coast

In a study of extreme waves by the Peak Over Threshold (POT) method, the determination of the threshold of data censoring is an essential step. A wrong choice of the threshold can lead to erroneous results of the wave height design and consequently a bad design of maritime structures such as breakwaters for deep sea ports. In this study, we analyzed the influence of the threshold variation on the results of the hundred-year return period waves, generally considered for the design of maritime structures. The sensitivity study allowed us to confirm that the exponential model is the best probability distribution to describe wave data in two points on the Moroccan Atlantic coast for the wave data period from 1958 to 2019. This study also confirmed that a wrong choice of the statistical distribution and a wrong choice of the threshold lead to significant errors in the estimation of design wave height. * Corresponding author: albakalihosny@gmail.com © The Authors, published by EDP Sciences. This is an open access article distributed under the terms of the Creative Commons Attribution License 4.0 (http://creativecommons.org/licenses/by/4.0/). E3S Web of Conferences 314, 04002 (2021) https://doi.org/10.1051/e3sconf/202131404002


Introduction
The determination of the design wave height is a crucial step in the design of maritime structures. The significant wave height generally considered is the 100 years return period wave, which is determined by statistical analysis of the wave data available over periods of a few decades. The first recordings of waves are made in the sixties [1]; therefore the available periods are insufficient for an accurate calculation of waves with low occurrence probability. The extreme waves heights are determined by extrapolation from statistical distributions fitted to the wave data.
To select the data for statistical analysis, two methods are used: • The first approach is called BLOCK MAXIMA method [2]. The data is the maximum values over regular time intervals, generally taken equal to one year. In this particular case, the method is referred to by the "ANNUAL MAXIMA METHOD (MA)".
• The second approach is the Peak Over Threshold method (POT) [3].
The two methods consider independent and identically distributed (iid) of random variables. POT method required iid values above a chosen threshold.
The POT method has the advantage of considering all the significant waves [2]; nevertheless, the determination of the censoring threshold is the key step for obtaining the best relevant data of extreme waves and consequently the best estimation of the wave height design.
The threshold is set to obtain an average number of events per year over the entire data period equal to or slightly less than N a , which is the number per year of extreme storms in the study area [4]. Authors of [5] propose the determination of a first threshold u 1 to obtain a number of events between 5 and 10 similar to the number of significant storms in extratropical zones. The second threshold u 2 , is the abscissa where the mean residual life plot function becomes linear [8]. The u 2 threshold should not be very high to maintain a significant number of data for statistical analysis. The N a number should remain approximately from 2 to 5 [5].
The existing methods do not allow a single choice of threshold regardless of the size of the data and the study area. A local sensitivity study of extreme waves for the case of the Moroccan Atlantic coast is therefore required to propose the safest approach.
In this paper, we will present the impact of the threshold variation on the predicted 100 years return period significant wave height. We will carry out a sensitivity study at two points on the northern Atlantic Moroccan coast. The wave data are taken from the SIMAR-44 database for the period from 1958 to 2019 on two points: SIMAR network N° 1050036 and N° 1042030 located on the coast of the cities of Mohammedia and Safi.

Method for selection independent waves
Authors of [5] and [6] recommend the verification of the following criteria for the choice of the extreme values of storms; these criteria are fixed for each site according to the local meteorological and maritime conditions: • The minimum duration of the storm above the chosen threshold U is greater or equal to N H with a possibility of a storm subside below the threshold for a duration of n H (n H is around 3 hours) • A time lag of 1 to 3 days should be between two extreme values.
According to Caires and Sterl [7], the N H time interval between two independent storms is 48 hours. To ensure the independence of the storms, we will adopt the criterion of minimum interval between two successive extreme values of 48 hours.

Determination of the threshold
The mean excess function (MEF) describes the prediction of the exceeding threshold u when an excess occurs [8], is defined by (1): A linear trend of the Mean Excess Function in the threshold u T indicates a stabilization of the parameters σ u and ξ of the Generalized Pareto Distribution fitted to the data, u T is therefore a possible value of the threshold [8]

Software used
We used the HYFRAN-PLUS software, designed for the analysis of hydraulic data. This software allows the adjustments by several statistical models as well as the comparison between these adjustments by graphical method and the goodness of fit tests [9].

Calculation of empirical non-exceeding probabilities
The general formula for determining the empirical probabilities of non-exceedance is (2): We will consider the compromise distribution proposed by Cunnane [10], with α = 0.4

Method for selecting the most suitable distribution model
The model selection is made by a multi-distribution analysis. We studied the theoretical distributions: Gumbel, GEV, Weibull, Gamma, Inverse Gamma, Lognormal, and Exponential.
A first choice of the most suitable distributions is made by comparing the graphical fittings to the data [11]. The final selection of the best theoretical model is made by referring to the goodness-of-fit tests: χ2 [12], AIC [13], and BIC [14].

Synthesis of the results on the coast of Mohammedia city -Point SIMAR network N° 1050036
3.1.1 Determination of censorship thresholds for wave data Figure 1 presents the mean residual life plot function of the extreme wave's data. In addition to the graphically determined thresholds, and for comparison purposes, we added an additional value u'= 5.00. The total and the average per year wave's number above thresholds are presented in Table 1:  We present in Table 2 the results of the goodnessof-fit tests of the most graphically appropriate distributions. Based on the graphical comparison and the goodnessof-fit tests; we conclude that the adequate model is the exponential distribution with maximum likelihood method for the estimation of the distribution parameters. Figure 3 presents the results of the hundred-year return period wave's height as a function of several threshold values. The wave's height values retained are illustrated in full black patterns.

Determination of censorship thresholds for wave data.
We present in figure 4 the mean excess function of the extreme wave's data. The total and the average per year storms number above thresholds are presented in table 3:  The results of the Goodness-of-fit tests of the best graphical fitting models are given in table 4: Based on the graphical comparison; we conclude that the adequate model is the Pearson III distribution with the method of moment for the estimation of the model parameters.
The results of the hundred years return period wave's height for the statistical distributions with the best graphical fitting are presented in Figure 6:

Results analysis
The sensitivity study results highlights the following points: • The GEV model with the estimation of the distribution parameters by the weighted moment method presents the biggest sensitivity to the variation of the threshold values.
• The asymptotic trend of the forecasting model remains insensitive to the change of the threshold. Indeed, the Exponential model remains an excellent graphical adjustment of the selected data for the different thresholds values.
• The extreme values of thresholds (u < 4.5m and u > 5.5m) generate significant disparity in the forecasting extreme waves.
• The values of thresholds in the interval [4.5; 5.5m] generate insignificant disparity in the forecasting extreme waves. Hence, the stability interval of forecasting the hundred-years return period wave height indicates that the threshold must be within the range between 4.5m and 5.5m approximatively.
• The threshold stability interval corresponds to the interval number (Na) between 1.2 and 3.6 in the case of Mohammedia's point and from 1.2 to 4 for Safi's point. these results confirm the importance to keep the N a number between 2 to 5 as recommended by Mazas and Hamm [5] • The too-high threshold implies an erroneous forecasting model for extreme waves; this observation is validated as mentioned by Coles [8].
The parent distribution is a local characteristic that varies from one zone to another [15]. In the study area between Mohammedia and Safi the best fitting model to the data is the exponential model. This threshold sensitivity study showed that an overestimation or Threshold(u) Hs (P=100 years) underestimation of the censoring threshold led to an erroneous forecasting model.
It should be noted that for the data in the point in Mohammedia's coast. The threshold U 3 = 5.55 seems to be a transition between the two models: exponential and GEV, Figure 7 presents the two adjustments. Indeed, the fitting tests favor the GEV model. However, the graphical comparison indicates that this theoretical distribution presents an overestimation of the extreme values.

Conclusion & Discussion
Statistical study of wave data over a 62-years data period assumes that the data is stationary and there is no long-term variation due to climate change. This hypothesis is not valid according to studies of wave climate evolution during the 20th century [16]. Therefore, this study could be extended to examine new models of the trend of extreme values depending on climate change.
Examination of the available data for the 62 years does not reveal a worsening of extreme events in the study area. Hence, the hypothesis made concerning the identically distributed data is a priori a security hypothesis. Extending this analysis to the entire Moroccan Atlantic coast will allow the decomposition of this coast into homogenous zones. Each zone can be described with an adequate theoretical model for forecasting extreme values.
The regional analysis for the determination of the censorship threshold allowed the definition of the interval of the most suitable values. This analysis could be extended to other points for the confirmation of the results and the determination of the intervals of the thresholds of censorship on the entire Moroccan Atlantic coast.