Tracing the value of data for flood loss modelling

Flood loss modelling is associated with considerable uncertainty. If prediction uncertainty of flood loss models is large, the reliability of model outcomes is questionable, and thus challenges the practical usefulness. A key problem in flood loss estimation is the transfer of models to geographical regions and to flood events that may differ from the ones used for model development. Variations in local characteristics and continuous system changes require regional adjustments and continuous updating with current evidence. However, acquiring data on damage influencing factors is usually very costly. Therefore, it is of relevance to assess the value of additional data in terms of model performance improvement. We use empirical flood loss data on direct damage to residential buildings available from computer aided telephone interviews that were compiled after major floods in Germany. This unique data base allows us to trace the changes in predictive model performance by incrementally extending the data base used to derive flood loss models. Two models are considered: a uni-variable stage damage function and RF-FLEMO, a multi-variable probabilistic model approach using Random Forests. Additional data are useful to improve model predictive performance and increase model reliability, however the gains also seem to depend on the model approach.


Introduction
Flood loss modelling is associated with considerable uncertainty which is due to an incomplete knowledge about the damaging process and to the inherent variability of quantities involved [1].In view of large uncertainty in flood loss model predictions, the reliability of model outcomes is questionable and challenges the practical usefulness of model results; particularly when this affects the quality of decision as for instance on the investment in flood defences [2].Therefore, it is of high importance to complement model outcomes with quantitative information about prediction uncertainty [3].In comparison to traditional stage-damage functions which simply relate flood loss to inundation depth, multivariable flood loss models, which take additional factors as for instance building characteristics, precaution, contamination etc. into account are an improvement to explain the variability of observed flood loss data [4,5].In spite of this, uncertainty ranges of flood loss predictions are still large, and thus probabilistic modelling approaches which take uncertainty into account and provide quantitative information about model prediction uncertainty are required [3].
A key problem in flood loss estimation is the transfer of models to geographical regions and to flood events that may differ from the ones used for model development [6].Variations in local characteristics and continuous system changes require regional adjustments by updating the model with local evidence [3].
In this light, the demand for more and systematically collected data is an obvious conclusion.However, the acquisition of information on flood loss and influencing factors is elaborate and costly.Therefore, it is of relevance to assess the value of additional data in terms of model reliability improvement.
We use empirical flood loss data on direct damage to residential buildings available from computer aided telephone interviews (CATI) that were compiled after major floods in Germany [7].This unique data base allows us to trace the changes in predictive model performance and reliability within a split-sample validation test by incrementally extending the data base used to derive flood loss models.Further, it offers the possibility to gain insight into the benefit of incorporating local evidence to a flood loss model.To study the implications of additional data on model prediction uncertainty, the analysis is conducted for probabilistic flood loss modelling approaches.

Empirical flood loss data
Empirical data of direct flood damage to residential buildings and related damage influencing variables are available from CATI that were carried out after the floods in 2002,2005,2006,2010,2011 and 2013 in Germany.A compilation of loss cases broken down for events and river basins is provided in Table  Within the CATI campaigns a broad range of information were gathered covering damage influencing aspects related to flood impacts, building characteristics, socio economic status, precaution and early warning.From this extensive data set, 28 candidate variables were preselected to be used in a modelling context for predicting the relative loss ratio of residential buildings (rloss).These candidate variables were selected according to experiences from previous analyses [7,8].This selection of variables for flood loss modelling is narrowed further in this study based on the analysis of the out-of-bag feature importance as an indicator for the relevance of individual variables [8].Accordingly, the variables water depth (wst), building value (bv), floor space of building (fsb), contamination indicator (con), return period of flood peak discharge (rp), inundation duration (d), precautionary measures indicator (pre), emergency measures indictor (em), age of interviewed person (age) and indicator of flood warning information (wi) are used to predict rloss.Further details about the variables are documented in Merz et al. 2013 [8].

Loss models
Two flood loss models are considered: (i) uni-variable stage-damage function and (ii) multi-variable RF-FLEMO which is based on the machine learning technique of random forests (9).
The model structure of the stage-damage function (sdf) for the estimation of rloss is defined as a two parameter (a, b) square root function of water depth (wst) as given in Equation 1:

UORVV D E ZVW
To evaluate predictive uncertainty, the sdf model is cast in a Bayesian modelling framework using a Monte Carlo Markov Chain approach for Gaussian linear regression using Gibbs sampling.Within this framework, the posterior distributions of model parameters a and b are used to sample the predictive distribution of UORVV which describes model predictive uncertainty.Calculations are carried out using the R-package MCMCpack [10].
The multi-variable model RF-FELMO is built using the machine learning technique of random forests (RF).RF is an ensemble of Regression Trees (RT) derived by generating many bootstrap replicas of the data set and by growing a RT on each replica.RTs are tree-building algorithms for predicting continuous dependent variables [11].They recursively sub-divide the predictor data space into smaller regions in order to approximate a nonlinear regression structure.At each split the data set is partitioned into two sub-spaces in such a way that the improvement in predictive accuracy is maximised.Bootstrapping captures the effects of data variability as one source of uncertainty in flood loss modelling [12].The ensemble of candidate RT composing the RF represents a variety of model structures reflecting model structure uncertainty.The sample of rloss predictions provided by RF-FLEMO represents the prediction uncertainty of the multi-variable modelling framework.RF-FLEMO model derivation and rloss predictions are carried out using the R-package randomForest [13].

Analysis framework
The implications of additional data on model performance are investigated in a split-sample validation test framework.Basically this involves a splitting of available data into two sub-sets which are either used for model derivation or for an independent evaluation of model predictive performance also referred to as model validation [14].Evaluation of model predictive performance and predictive uncertainty is implemented by means of a set of performance criteria for accuracy, reliability, and sharpness and prediction skill.

Split-sample validation experiments
Two validation experiments are designed.The first investigates the value of data within a gradual learning setting.In this case, the amount of data available for model derivation is incrementally increased and model performance and predictive uncertainty are evaluated using an independent split-sample of the data.This validation sample is randomly drawn from the complete sample and is not used to derive the models.
The second experiment examines the effect of using local evidence, i.e. regional specific observations to update a basic model.In this context basic flood loss sdf and RF-FLEMO models are derived using randomly selected loss cases from the complete data sample and then gradually including regional data available from specific local data sets of different CATI campaigns in the derivation of the models.Accordingly, the regional

Conclusions
The value of data for the performance and reliability of flood loss predictions has been analysed within incremental split-sample and regional updating validation tests conducted for two probabilistic flood loss models.