Partitioning around medoids approach application for computation of regional flood and landslide quantiles

. Flood and landslides causes serious damage to the functioning of the society which results in a huge loss of human life, material and other environmental impacts. In this paper, partitioning around medoids approach is executed for the assessment of flood quantiles over 145 sites using 11 basin characteristics. The study region is classified into 6 clusters as a result of the partitioning algorithm which are further proved to be homogeneous by applying the heterogeneity measure test. Results from the study provided the regional flood quantile measurements for the ungauged sites derived from L moments with good accuracy limits for the recurrence intervals 50,100,200 and 500 years. As a floods landslides may caused by rainfalls, especially over long time periods, which both increase the weight of slopes and can lubricate planes of weakness within rock or sediment. It is shown that landslides are also allocated in some of the clustered zones, depending of geological conditions of the clusters. Thus regional flood quintiles in conjunction with geology and topography forms landslide activity quantiles.


Introduction
The problem of preventing and mitigating the severity of hazardous natural and anthropogenic processes effect is relevant for most of the world's territories, including Russia, India and . Expected economic and social risks are particularly high in the North Caucasus due to high population density and high level of volcanic, seismic, landslide, glacial, mudflow, flood and other hazards. At the same time, specific factors of the evolution of the natural environment can provoke catastrophic natural and natural-anthropogenic processes that were not previously studied at a sufficient level for a reliable prognosis and, therefore, often unexpected and destructive: a powerful landslide near the village of Mizur in 2002 and surface failures in the area of village Sadon, as a response to the intensive and irrational mining activity in many respects in the past. In 2002, Kolka glacier unexpectedly collapsed along the northern slope of the Kazbek volcano in the Karmadon Gorge, which caused the death of more than 100 people and caused an ecological catastrophe [1]. Later, on May 17, 2014, from the opposite slope of Kazbek, a large amount of rocks and ice collapsed in the area of the Devdorak glacier in the territory of Georgia. One of the main factors assumed is high rainfall preceding the mentioned catastrophic events. Water saturation is critical in such processes as liquefaction phenomenon [2,3]. In August 2018 a large number of people died (more than 400 people) as a result of landslides and downpours in the south and west of India. Flooding and landslides have led to significant violations of rail and road traffic in the south. In India, the current monsoon rainy season is one of the most active in the last 100 years.
The most dangerous hydrological natural hazard is a riverine flood which is caused by the overflow of water from the river channel into the dry surface. The influences of flood are very high as it endangers the life of human being and especially the engineering constructions near the rivers. Flood frequency analysis determines the relationship between the magnitude and frequency of floods which are essential in constructing the dams, culverts and bridges [4]. The at-site estimates of flood can be determined easily, since the data are readily available at the gauge [5]. But the prediction of flood quantile magnitude for ungauged sites is one of the biggest challenges faced by water resource engineers, where the amount of information available is very less [6][7][8][9]. Regional flood frequency analysis (RFFA) evaluates the information about the flood at the ungauged sites which is useful in flood plain management. The important steps in RFFA are to identify the homogeneous regions, choosing the appropriate frequency distribution with respect to regionalization and the final step is to determine the flood quantiles at the target site [10][11][12]. Annual maximum flood (AMF) and peaks over threshold (POT) are widely used in the prediction of extreme events. So many studies have been carried out to develop the regionalization approaches which were designed for the assessment of extreme events in the ungauged sites by extracting information from the similar group of gauged watersheds [13][14]. Region of influence (ROI) approach applies Euclidean distance to group the sites based on the information extracted from the gauged sites with respective to the target site considered [15].
Clustering techniques are widely used in regionalization where the similar sites are placed under one cluster and dissimilar sites occupy the other group using partitioning and hierarchical clustering algorithms [16]. Fuzzy based clustering is combined with selforganizing feature map to delineate the watersheds for estimating the flood quantiles in the ungauged sites [17]. A new regional flood frequency analysis approach based on anarchy connected with each and every data set for detecting the outliers was developed in recent years [18]. Canonical correlations analysis mainly concentrates on the linear relationship between the variables when applied to the ungauged streams [19]. Partitioning Around Medoids (PAM) clustering algorithm deals with distance matrix that focus on medoids as the cluster centers which is very useful in the case when the data points cannot be assigned to any of the cluster [20].
Many studies have concentrated on delineating homogeneous regions and in estimation of regional quantiles using non stationarity principles [21][22][23]. This study aims at grouping the sites based on medoids rather than traditional k-means method that uses means as the cluster centroid which suffers from major drawbacks in dealing with noise and outliers. Also, the medoids based approach concentrates in reducing the sum of dissimilarities between the pairs of data points rather than considering the squared sum of Euclidean distances as in the case of k-means. The objectives of the study are 1) to classify the region into groups by applying the partitioning around medoids algorithm 2) to determine the homogeneity of the clusters formed 3) to trace out the best frequency distribution that fits the data 4) to estimate the flood quantiles for those homogeneous sites and also to derive the parameters for the distributions at 0.90 level of significance 5) to evaluate the performance of the approach using the error statistics.

Study area
Partitioning around medoids approach for the assessment of flood quantiles is executed over 145 sites in Indiana watershed which is in the east-central United States (Figure 1). The annual maximum peak values for all the stream flows sites with more than 10 years of data were obtained from United States Geological Survey (USGS). The physiographic attributes are drainage area, main channel slope, main channel length, average basin elevation, storage (percentage of the drainage area contributed to the study which is covered by water bodies), forest (percentage of the drainage area enclosed by forest), soil runoff coefficient and location of sites. The meteorological attributes are mean annual precipitation and I24,2 (24-hour rainfall having a return period of 2 years). The information regarding the attributes were extracted from [24]. While, the cluster analysis is applied extensively to the attributes of 145 sites and totally the study region is divided into 6 clusters.

Methodology
Discordancy measure is carried out to eliminate the conflicting sites from the group. If the discordancy value for any site exceeds the critical value 3, then those sites are removed from the group [25]. The shape of the frequency distribution is obtained using the L-moments. The location of the distribution, scale or dispersion, skewness and kurtosis of the distribution are also derived using the L-moments.
The cluster analysis aims at partitioning the data sets by following the principle, that similar data sets occupy one cluster and dissimilar data sets forms the other one. In this study, PAM (Partitioning around medoids) is used to cluster the Indiana water shed based on the attributes as mentioned in previous section. Where, PAM's method is a partitioning based clustering that mainly concentrates on the medoids of the data points which are actually the centroids that are used to form the cluster [26][27][28]. Sequence of objects are first identified based on medoids and it forms a set T. Let Q be the set of objects and R is the set of unselected objects. Each and every data point is assigned two numbers with respect to the dissimilarity between the object to the nearest object in the total set T (Ki) and the dissimilarity between the object to the second nearest object in the total set T (Li). During the SWITCH process, these two numbers must be updated regularly. Homogeneity measure is applied to the clusters formed in order to find out the homogeneous groups. The observed variations are compared to the homogeneous regions using the L moment ratios for all the sites.
Once the clusters are tested for homogeneity, the next stage is to select the suitable distribution for each and every cluster group that would produce precise flood quantiles estimation. This can be achieved by L-Moment ratio diagrams in which the L-Skewness and L-Kurtosis values are compared. The best distribution is selected, when most of the data points lie closer to that line, where three parameters of the 5 distributions namely generalized logistic distribution (GLO), generalized extreme value distribution (GEV), generalized pareto distribution (GPA), generalized normal distribution (GNO), and Pearson Type III distribution (PE3) are plotted in a graph. To precisely select the distribution better than the graphical representation, goodness of fit test is applied to the data points, which is given by Zdist value in Equation (1).
where dist value denotes either of the 5 distributions, Kdist gives L-Kurtosis of the distribution, K R represents regional average L-Kurtosis which weights uniformly to the record length of the sites, B R is the bias value of the sample L-Kurtosis and σR is the standard deviation. If the Zdist value is closer to 0, then it is considered as a satisfactory measure. If Zdist ≤ 1.64, the distribution is said to lie within the critical level. Development of the regional flood frequency relationships is essential in estimating the quantiles of flood corresponding to recurrence intervals 50,100,200 and 500 years. Each site quantiles are obtained by using the parameter M which represents the vector of probabilities. Where, M (0.98) denotes 50 years, M (0.99) denotes 100 years, M (0.995) denotes 200 years and M (0.998) denotes 500 years of return period. Regional estimates are derived by considering one site at a time as ungauged and treating all other sites in the group as gauged. Each site's quantile function can be obtained by multiplying the mean of each site by its regional growth curve. Growth curve qi for site i is given by Equation (2).
Where Qi is the site's quantile function and the mean for site i is given by µi which is also the measure of index flood. Thus, the estimation of floods for each and every site is derived by the product of normalized regional quantile and first order of L-moments, i.e. mean of each site. Identical results are assumed to be estimated for clusters with same frequency distribution. To estimate regional flood quantile ̂f or each site i, with respect to a T year return period, the Equation (3) is used. (3)

Results and discussion
In this study, out of 145 sites, 3 sites were found to be greater than the critical value 3 according to the discordance measure as discussed in the previous section. Site number 3364570 varies distinctly from the other sites with the discordancy value 7.47 which is consider as very higher than the critical value 3. Whereas the discordancy values of the site numbers 3342180 and 3335700 are slightly greater than the critical value occupying the border level. Hence those 3 sites are removed from the group and remaining 142 sites are considered for the cluster analysis. Table 1 gives the information about the 3 discordant sites along with their L moment ratios which clearly illustrates the site number, size, mean of annual max flood, L-CV (co-efficient of variation), L-Skewness, L-Kurtosis and the discordancy measure of the discordant sites.

Formation of Homogeneous Groups
PAM clustering approach is applied to the 142 sites of Indiana watershed in order to form a group of 6 clusters which are considered to be "Definitely Homogeneous". The heterogeneity measure test has been extensively worked out to derive the status of the cluster with respect to homogeneity. Table 2 Fig. 1 visually shows the dispersion of all the sites according to their respective clusters, which were partitioned into and derived to be as homogeneous. This is the most difficult step in regional flood frequency analysis as it requires a more descriptive knowledge about all the site characteristics. The algorithm identifies the k representative objects which are considered to be the cluster centers with respect to their medoids. Silhouette width plays a major role in forming the regions. If the silhouette width is positive, then the sites exactly belongs to that cluster. If the silhouette width is negative, then the site belongs to the neighboring cluster. In the other case, if the silhouette width is 0, then the site is in the border of both the cluster. Based on the width, various adjustments are made to the sites so that they definitely form the homogeneous regions.

Flood frequency relationship
In Figure 2, most of the data points lie closer to the GEV distribution line for clusters 1,2,3 and 6, whereas sites are closer to GLO distribution line in the case of cluster 4 and finally PE3 proves to be the best distribution by attracting more sites towards the line in the case of cluster 3. Hence L-moment ratio diagrams pictorially represents the appropriate distribution by comparing the L-skewness and L-kurtosis values of the sites chosen in the cluster rather than the numerical values. Regional flood frequency relationship is developed to estimate the quantiles of flood with the vector of probabilities 0.98,0.99,0.995 and 0.998 for the return periods 50, 100, 200 and 500 years respectively. The product of mean with the regional growth curve gives the quantile functions for every site whereas regional estimates are derived by considering one site at a time as ungauged and combining all the sites together and substituting those values in Equation (2), regional growth curve are framed. At site and regional quantiles are compared using GEV, GLO and PE3 distributions.
The relationship between the at-site and regional quantiles using the GEV, GLO and PE3 distributions for the return periods T= (50 years, 100 years, 200 years and 500 years) implies that the data points lie closer to the solid line in 1:1 ratio. Medoids based approach classifies the streams in Indiana into 6 clusters and L-moment approach is applied to the analysis for quantile derivation. According to the performance ratings of each metrics this study had provided the best results where, NSE, d, r and R 2 values are closer to 1 according to the criterion.

Results and Discussion
Regional flood frequency analysis using the PAM clustering approach was employed on 145 sites in the study area and finally classified into 6 homogeneous clusters, where all the H values were derived to lie below 0. Goodness of fit measure was employed on the data sets to find the best distribution that suits to estimate the flood quantiles where, Z statistics had identified GEV as the best suitable distribution for the clusters 1, 2, 3 and 6 with the values closer to zero. And for the clusters 4 and 5, the best distributions selected were GLO and PE3.
The performance of the approach is validated and hence the best results are provided and proved to be more accurate by the metrics NSE, index of agreement, Pearson correlation coefficient and coefficient of determination with the values closer to 1. Based on the results, medoids based clustering approach proved to be the robust method when compared with the traditional k-means based clustering which suffers from drawbacks in dealing with noise and outliers. The regional flood quantiles generated for the ungauged sites were derived from L moments with good precision bounds for the recurrence intervals 50,100,200 and 500 years. Thus, the computation of flood seems to be very important for the civil engineering structures constructed at or near the waterbodies whose rarity is hence determined by the various return periods, which is allied with flood control. Landslide processes strongly depends of hydrological factors. Data on historical landslides on the investigated territory can be found in [29], information on current landslides is presented in the Indiana Standard Hazard Mitigation Plan (SHMP, Indiana Department of Homeland Security, Polis Center of Indiana University, 2019). Compiled map in fig. 3 shows that landslides are also trends to have a cluster structure. Much of them are concentrated in cluster 5 and cluster 6 intersection the next huge amount of landslide processes are in the region of cluster 2 and 3, which are of Mississippian, Ordovician and Sillurian bedrock geology. Other events may be considered as background ones. Thus landslides are correlates with flood quantiles, depending on geology and topography. Application of modern complementary approaches, including considered one in integration with the methods of system analysis leads to multifactor methodology for exogenous geological events estimation and development of an effective system for mitigation.