Detection of debris-flow events from seismic signals using Benford's law

,


Introduction
Debris flows are hazards that occur in channels (e.g., gullies, ravines, and valleys) with massive destructive power [1,2]. Despite significant efforts to mitigate debris flow hazards through warning systems, risk assessment, and structural measures [3][4][5], extensive damage and casualties still cannot be prevented due to the complex geological conditions and dynamic processes governing debris flows [6]. Early warning is a promising approach to reducing debris flow hazards, which is becoming more precise and reliable as data acquisition and transmission improve.
Debris flow warning systems could employ two types of sensors to collect data: measuring the indicating parameters and monitoring the flow dynamics [7]. Traditional rainfall data-driven systems belong to the former class and require locally tailored thresholds to trigger alarms [8]. However, obtaining rainfall data and maintaining monitoring devices is not straightforward for catchments with large elevation differences and multiple sediment supply areas [9]. In the case of Illgraben (Switzerland), where we conducted the present study, the locally convective short-duration storms are difficult to monitor but are sufficient to trigger debris flows [10]. Continuous seismic signals belong to the second sensor type offering a relatively new opportunity to monitor debris flows with high temporal resolution. Coviello et al. [11] proposed an early warning system based on an automatic debris flow detector using seismic waveforms and a short-term average to longterm average ratio (STA/LTA) trigger algorithm. Chmiel et al. [12] combined the random forest algorithm * Corresponding author: qi.zhou@gfz-potsdam.de and seismic waveform features to detect debris flow fronts and identify when they pass over check dams. In addition, seismic networks can contribute to detecting and locating large mass movements on a regional scale [13]. However, a seismic station senses all ground shaking within its bandwidth and records the signals, blending events of interest with ambient noise. Only a small amount of seismic data corresponds to geomorphic processes, which makes labelling events a time-consuming work requiring expert knowledge. The present seismic-driven early warning methods require labelled data to train a detection model. Transferability of these models to other study sites where little or no training data exist is not guaranteed.
Benford's law (BL) is a statistical phenomenon of the probability distribution of the first digit in a dataset. It originated from the observation that the first few pages of the logarithm tables were more worn out than others. Newcomb [14] declared that the probability of the first digit is such that the mantissae of their logarithms are equally probable. BL was rediscovered and tested with 20 different data domains by physicist Frank Benford [15] and named after him. BL gives the probability of the first digit: (1) where PD is the theoretical probability of the first nonzero digit. D ranges from 1 to 9. For example, 0.01, 100, -1 share the same first digit one with a probability of 0.301.
A classical mathematical theory to explain why the first digit of a dataset appear in BL is that after a sufficiently long computation in floating-point arithmetic, the occurring mantissas have a nearly logarithmic distribution [16]. For more information, we refer to [17][18]. Even if BL is base and scale-invariant [19], not all collected datasets follow it. Durtschi et al. [20] summarized the guidelines on when data may conform to BL: (1) unrestricted (data from human height and weight is, e.g., restricted) and assigned data (e.g., check numbers and invoice numbers), and (2) large sets of data across several orders of magnitude (e.g. array from 1 to 10 7 with a sample size of 1.2×10 5 ). BL has been typically applied to examine data anomalies [21], and some researchers have adopted it to finance and sociology [22][23]. Recent work has applied BL to geoscience data, such as investigating natural hazard dataset homogeneity and anomalies [24][25]. It has been demonstrated that earthquakes and marsquakes are detectable with BL using seismic waveforms [26][27].
In this work, we calculate the first digit distribution of seismic signals (generated by debris flows, floods, and other surface processes [28][29]) and validate the compliance with BL. Then, an event detector model based on seismic signals and BL is introduced with the example of debris flow.

Study site
The Illgraben catchment, located near the village of Leuk, in southwestern Switzerland (Fig 1), covers an area of about 9.5 km 2 and ranges in elevation from the Rhône River at 610 m to Illhorn mountain at 2716 m [30]. In the upstream channel trunk, the limestones (north side) and quartzites (south side) are susceptible to erosion and weathering, which provides debris flow material [31]. The Illgraben covers less than 0.2% of the Rhone Valley, but it contributes more than 5% of the annual sediment to the Rhône basin [32]. The local annual rainfall is concentrated from May to October, and debris flows are mainly triggered by short-duration storms [10]. Three to five debris flows and several floods are observed at the Illgraben catchment each year. respectively. Considering seismogenesis is concentrated at the debris flow front [33], the greater the distance between the flow front and the seismic station, the more significant the proportion of noise in the recorded signal. Here, only station IGB02, which is closest to the channel and far from the nearby residential area of Leuk, was selected as the data provider.

Event catalog
From 2013 to 2014, 24 debris flow events (true positive event, TPE) were collected, and one may be a flood. Ten of these events were observed by the Swiss Federal Institute for Forest, Snow and Landscape Research warning systems (henceforth referred to as "WSL"). A debris flow example with a WSL label is shown in Fig.  2. Based on amplitude durations in the waveform and spectral features between 1 and 50 Hz, the remaining 14 debris flow events were manually labelled by us (henceforth referred to as "GFZ"). Unfortunately, no other data for these 14 events were available for crossvalidation. The statistics of the time accounted for by TPE (about 160 hours in total) and non-TPE has a ratio of 1:58. To create a true negative event (TNE) catalog (all non-debris flow events), we selected 1200 TNE (50 times of TPE) events at random start times and random duration (from 20 minutes to 6 hours) outside the TPE period. Finally, a test dataset containing 1224 events was created. Besides, one rockfall event (rockfall1) were collected in Illgraben, outside the Illgraben catchment, one landslide in Iceland [28], one rockfall (rockfall2) in Germany [29], and one bedload transport in Taiwan (ROC) were collected to test for compliance with BL.

Validate natural hazard's conformity to BL
The vertical component of raw waveform data, without instrument response deconvolution, demeaning, detrending, filtering, or tapering, is utilized to validate the event conformity to BL. To this end, the first digits are counted with a 60s sliding window. The raw amplitude equal to zero for each moving window will be discarded. To compare the observed first digit distribution with the theoretical BL distribution (Equations in 2-3), we refer to the goodness of fit (Ф)  [26] and introduce the Shannon entropy (H) [34] to assist the assessment and detection: (2) ( 3) where Pd and pd are the theoretical and observed probabilities of first digits; H(D) is the Shannon entropy H, D={1, 2, …, 8, 9}. When b=2, the unit of H(D) is bits or shannons. Parameter H describes the relationship between information and uncertainty. The higher the occurrence probability, the less information an event carries. In our case, noise dominates most of the year, so H is expected to increase from a true negative event to a true positive event. The theoretical maximum of H is 3.17 when the first digit (1-9) is a uniform distribution with a probability of 1/9.

Detector with BL method
There are two indicators to design a debris flow detector with BL, the amplitude range ar and goodness of fit Ф of BL (Fig. 3a). The ar is more concentrated during the TNE (noise) (Fig. 3b), ranging only by 1-2 orders of magnitude, while the amplitude could cross several orders during the TPE (event). ar is defined as: (4) where Q is the amplitude value from smallest to largest value within the sliding window, Q1 and Q100 are the average within largest and smallest one percentage. A sliding window (Fig. 3a) will scan ar and Ф at 60s intervals. If ar at the time i is greater than the threshold start (5000 bits), a potential event will be recorded until ar at time j is less than the threshold off (1500 bits). When the potential event duration (tj-ti) is greater than 20 and Ф is greater than a given threshold g, the potential event will be marked as predicted positive. If the above conditions are not met within the entire event, the test event will be marked as predicted negative. The 1224 events in the test dataset will be extracted one by one for model testing. The range of ar, Ф is set as [3×10 3 , 1×10 5 ], [0, 80], and the interval is 10 4 and 20, respectively. We combine the ar and Ф orthogonally, then feed them into the model for testing.
We use a confusion matrix to assess our detector performance. The detector outputs were divided into four categories (Fig. 4). We employed positive ratio TPR and false positive ratio FPR to evaluate the detector model performance.

Preliminary results and concluding remarks
Our preliminary results show that the first digit distribution of seismic signals generated by debris flows, landslides, and bedload transport follow BL, while our two rockfall cases do not conform to BL (Fig 5). Moreover, one flood case does not allow for rigorous conclusions to be drawn. When Ф is less than 0, and SH is close to 0, the event cannot be considered a highenergy mass movements like debris flows and landslides. The detector model has an optimal TPR of 0.75 and FPR of 0.01, and it could be improved by more parameter sensitivity analysis in the future. The seismic signals from long-duration and high-energy mass movements (e.g., debris flow and landslide) conform to BL (e.g., when the event starts in Fig. 2). This phenomenon provides a new potential approach for rapid and relatively accurate filtering events from seismic signals. In future work, we will explore why BL appears in seismic signals generated by some processes but not in others. Understanding conformity with BL will provide important insights into the performance of seismic mass movement monitoring. Furthermore, the debris flow detector is not yet efficient and reliable enough. It will be necessary to boost and compare our detector model to other approaches, such as the shortterm average to long-term average ratio (STA/LTA) and random forest model in the future.