Estimation of Right-censored SETAR-type Nonlinear Time-series Model

. This paper focuses on estimating the Self-Exciting Threshold Au-toregressive (SETAR) type time-series model under right-censored data. As is known, the SETAR model is used when the underlying function of the relationship between the time-series itself ( Y t ), and its p delays (cid:16) Y t − j (cid:17) p j = 1 violates the linearity assumption and this function is formed by multiple behaviors that called regime. This paper addresses the right-censored dependent time-series problem which has a serious negative e ﬀ ect on the estimation performance. Right-censored time series cause biased coe ﬃ cient estimates and unqualiﬁed predictions. The main contribution of this paper is solving the censorship problem for the SETAR by three di ﬀ erent techniques that are kNN imputation which represents the imputation techniques, Kaplan-Meier weights that is applied based on the weighted least squares, synthetic data transformation which adds the e ﬀ ect of censorship to the modeling process by manipulating dataset. Then, these solutions are combined by the SETAR-type model estimation process. To observe the behavior of the nonlinear estimators in practice, a simulation study and a real data example are carried out. The Covid-19 dataset collected in China is used as real data. Results prove that although the three estimators show satisfying performance, the quality of the estimate SETAR model based on the kNN imputation technique dominates the other two estimators.


Introduction
In the real world, datasets examined using time-series analysis often involve issues, such as censorship and nonlinearity, that directly prevent accurate analysis unless appropriately solved. In practice, these problems in the datasets are generally ignored. For econometric or financial time-series data, it is very common that the variance and autocorrelation of the series may change, which makes it necessary to use nonlinear time-series analysis. Data obtained during financial crises or radical changes in the economic situation of a country can be used as examples of a nonlinear time-series. In addition, we know that due to different reasons, including technical issues, individual mistakes, device errors, and so on, it may not be possible to observe all data points. For instance, for a fixed period of a risk assessment study, borrowers may join the study at various stages, and they may or may not default before the study ends. In this case, the observation is censored from the right randomly (see [1]). In this example, we can see two problems, nonlinearity, and right-censorship, that need to be solved to facilitate accurate analysis of the data.
Regarding the nonlinear time-series model, there are several types of models referred to as bilinear models by Granger and Andersen [2]. These models include the threshold autoregressive model (TAR) and its specific case Self-Exciting TAR (SETAR) proposed by Tong [3] and discussed further by Fan and Yao [4], the smooth transition autoregressive models (STAR) of Terasvirta and Anderson [5] and Markov-switching, introduced by Hamilton [6]. Notice that these models are capable of reflecting the true behavior of the time-series by representing the different regimes in it. This paper considers the estimation of the SETAR model under right-censored time-series data. There are several important studies about estimating SETAR models without censorship; these include Petruccelli [7], Gooijer [8], Milheiro-Oliveira [9], Naik and Mohan [10], and Aydın and Mermi [11], among others. However, in the literature, SETAR models are not considered under censored time-series. This paper aims to fill that gap.
There have been several important studies about right-censored time-series models, including Park et al. [12], Khardani et al. [13], Aydın and Yılmaz [14]. These studies do not consider nonlinearity and instead focus on different aspects of modeling. However, useful solution techniques are introduced for the right-censorship problem, such as synthetic data kNN imputation, Kaplan-Meier weights (see [14]), and synthetic data transformation [15], which is adapted to the time series modeling by [13].
The main purpose of this study is to introduce SETAR model estimation based on the aforementioned three censorship solution techniques: kNN imputation, Kaplan-Meier weights (KMW), and synthetic data transformation (SDT). Both the problems of nonlinearity and censorship can be solved appropriately using these techniques, but the advantages and disadvantages of each of the three estimation procedures are discussed to determine the best procedure for nonlinear and right-censored time-series modeling.
The organization of the paper is as follows: Section 2 introduces the methodology of the solution techniques for the right-censorship problem and their integration into the SETAR model. Section 3 presents the statistical properties of the introduced model estimation. Evaluation metrics are given in Section 4. A simulation study and a real data case study are carried out in Section 5. Conclusions are provided in Section 6.

Material and Methods
To estimate the SETAR model with right-censored time-series data, the censorship needs to be solved first because censorship solution techniques change the observations of the series.
Consider the time-series variable as {Y t } n t=1 with sample size n. In many cases, one may not be able to observe all Y t values completely. Instead, some Y t values will be accurately recorded and others incomplete or censored. Y t is censored from the right by censoring variable C t , which means that Y t is observed partly and is recorded as C t . Censoring of the response measurements occurs in different situations in business, finance, and economics. If censored observations are ignored, the estimated time-series models are usually unreliable regarding parameter estimates.
To express the censoring mechanism specifically, let us consider that Z t is the incomplete (right-censored) observed time-series instead of Y t due to C t . This case can be formulated as follows: where I (.) is an indicator function that involves the information of censorship existence and δ t is a bivariate variable that involves the censoring information. From Eq. (1), it is assumed that the censored time-series are denoted with pairs {Z t , δ t ), t = 1, 2, . . . , n}.
As indicated in Section 1, the TAR model considers the different regimes at different times, meaning that the series exhibit threshold behavior (see [3,4]). Accordingly, the regimes in the series are determined by a threshold or transition variable S t , which depends on threshold value m. Here, S t is one of the lags of the right-censored time-series Z t as S t = Z t−d where d is a lag parameter. Thus, the "self-exciting" definition is realized in the SETAR model, which is defined as follows: where ε t 's are random error terms that are independent and identically distributed with zero mean and constant variance σ 2 ε . φ 1 = φ 1, j , j = 1, . . . , q 1 and φ 2 = φ 2,i , i = 1, . . . , q 2 are coefficients of the SETAR models to be estimated with autoregressive degrees, where q 1 and q 2 are degrees of the lower and upper regimes of the autoregressive (AR) model, respectively. Notice that threshold variable S t depends on the lag (d) and threshold (m), which should therefore be chosen suitably (See [16]). In Eq. (2), the SETAR model is provided with two regimes.
The main interest here is estimating the vectors of autoregressive coefficients Φ (m) = φ T 1 , φ T 2 of model Eq. (2) for both regimes. In this matter, Tong [3] proposed a maximum likelihood estimator (MLE) under the assumption of normally distributed error terms and uses the Akaike information criterion (AIC) to choose the threshold constant (m) and optimal lag (d). However, MLE works well only when both regimes of Eq. (2) are first-order autoregressive models, which limits the advantages of the SETAR model (see [17]). On the other hand, [7] and [17] show that the conditional least squares (CLS) estimate of Φ (m) = φ T 1 , φ T 2 has consistency for SETAR models with different autoregressive degrees and the number of regimes. Also, [11] applied this method to the different data examples successfully. This study, therefore, focuses on the CLS approach for right-censored SETAR model estimation. Let us rewrite model Eq.
(2) as follows: where X 1t = 1, Z t−1 , . . . , Z t−q 1 , X 2t = 1, Z t−1 , . . . , Z t−q 2 are covariate matrices that are formed by the lags of the dependent variable φ 1 = φ (1,0) , . . . , φ (1,q 1 ) and φ 2 = φ (2,0) , . . . , φ (2,q 1 ) are the vectors of the coefficients for lower and upper regimes of Eq. (3) and ε t 's are the stationary random error terms with a constant variance. Model Eq. (3) can be simply written and given by Notice that the estimation of Eq. (4) depends on the optimal values of m and d parameters. Choosing m and d is achieved by calculating ε t . From equation Eq. (5) ,ε t (m) = Z t − X t (m) Φ (m) and the variance of the model is obtained asσ 2 ε (m) = n −1 n t=1ε 2 t (m). In addition, determining the threshold variable Z t−d and optimal lag d is an important problem to be solved. Therefore, regarding the SETAR model, m and d are selected by minimizinĝ σ 2 ε (m) by doing an appropriate grid search in the following minimization problem: wheremandd are the chosen threshold and lag parameters and the lag parameter should ensure the condition d < (q 1 , q 2 ). Instead of optimizing Eq. (6) , as shown by [18], AIC is used to choose optimal m and d as below: where n 1 and n 2 are the sample sizes in lower and upper regimes, respectively. Similarly, σ 2 ε 1 and σ 2 ε 2 are the variances of these regimes. It should also be noted that the selection of m,d is realized in three steps that are discussed by [19]. Thus, after the selection of m,d using Eq. (7) ,Φ (m) is obtained using Eq. (5) . However, due to right-censored timeseries Z t , equation Eq. (5) cannot be used directly. Therefore, the following three solution techniques are introduced: KMW, SDT, and kNN imputation.

Kaplan-Meier Weights
This section introduces adapting the SETAR model estimation given in Eq. (5) based on right-censored time-series Z t . To overcome the right-censored observations, KMW is used, as suggested and discussed by [18] and [20], respectively. In the case of SETAR model estimation, the weight matrix W is added to equation Eq. (5). Here, the Kaplan-Meier weights are given by: where whereΦ W (m) denotes the estimated coefficients based on KMW. Accordingly, a fitted model is obtained with Eq. (9) and Eq. (5): whereẐ W t denotes the fitted values obtained by KMW. Hence, the right-censored SETAR model is estimated by the modified estimatorΦ W (m) based on KMW.

Synthetic Data Transformation
Another solution for a right-censored time-series is synthetic data transformation (SDT), which is an unbiased way to transform right-censored Z t to synthetic variable Z tĜ to make equivalent their expected values as E (Z t ) E Z tĜ (see [21]). Similar transformation techniques have been studied by [22,23], and [24]. The SDT procedure proposed by [15] can be shown as: where G is the distribution function of censoring variable C t . However, in real-world applications, because G is generally unknown, instead of G, [15] suggested using its Kaplan-Meier estimatorĜ, which is given by arbitrary data point r: where n t=1 is ordered observation pairs as mentioned after equation (2.8). Note that, due to the main property of the SDT, it is clear that is the modified estimator of Φ (m) based on the SDT approach. Accordingly, the estimator and fitted model can be obtained as follows: whereẐ S DT t values are fitted values obtained based on the SDT technique.

kNN Imputation
kNN imputation for a right-censored time-series is introduced by [14] to overcome the right censorship. The main function of kNN imputation is to replace the right-censored observations with their kNN estimates. Notice that for any censored point, imputation is realized by an averaged value of k uncensored nearest neighbors. For a detailed discussion of kNN imputation, see [24] and [14]. Notice that the kNN imputation technique does not need any distributional assumption and does not touch uncensored observations, as in KMW and SDT. These properties are the main difference between kNN imputation from the other two solution methods. In this paper, to measure the distance between the neighbors, a commonly used Euclidean distance is used, which can be given by: An algorithm provided for kNN imputation by [14] is given below in table 1. From the algorithm given in table 1, the estimator and fitted model based on kNN imputation are provided by: whereẐ k t 's are the fitted values of the SETAR model obtained based on the kNN imputation technique. Input. Right-censored time-series Z t ; Censoring indicator δ t associated with Z t Number of nearest neighbors k; Determined lagged Z t s : Z t−1 , . . . , Z t−p to calculate distances. Output: Imputed dataset Z k t c 1: Begin 2: for (t = 1ton) do 3: if (δ t = 0) do (if data point is censored) 4: for ( j = 1ton) do 5: Find the distances between Z t−p 1 and Z t−p 2 for each censored data point with (2.15) 6: Sort the distances from small to large 7: for ( j = 1tok) do 8: Take the first uncensored k values of Z t associated to sorted distances 9: Calculate the t th imputed value Z k t with the average of nearest k records of Z t 10: Replace the imputed value Z k t with censored data point (Z t , δ t = 0) 11: Return Z k t 12: End.

Evaluation Metrics
This section is prepared to present the evaluation metrics for the introduced three SETAR model estimates based on the given three censorship solution techniques. The fits of these models are notated asẐ W t ,Ẑ S DT t andẐ k t . For our purposes, three commonly used metrics in time-series analysis are preferred. These are root means squared error (RMS E), mean squared error (MS E) and mean absolute percentage error (MAPE). Calculations of these measurements are given based on joint notation Ẑ t n t=1 of the mentioned three fitted values: By using these criteria, a comparison of the three model estimates is realized and the effect of censorship on the estimated SETAR models is measured.

Simulation Study
The purpose of this section is to investigate the performances of the censorship solution methods on SETAR model estimation using simulation evidence. Accordingly, right-censored, stationary times-series and SETAR models as in Eq. (2) are generated as follows based on the work of Chan and Tsay [25]: where Z t is obtained by censoring variable C t which is generated based on µ Y t and σ 2 Y t . Hence, pair of time-series (Z t , δ t ) is obtained as mentioned in Eq. (1). To clearly express the model parameters and related matrices, model Eq. (21) should be written in harmony with equation Eq. (3) as: where X 1t = (1, Z t−1 , Z t−2 ) and X 2t = (1, Z t−1 , Z t−2 ), φ 1 = (0.3, 0.7, 0.4) T and φ 2 = (1.5, −0.5, −0.2) T . Note that error terms are generated i.i.d. as ε t sim µ ε = 0, σ 2 ε = 0.5 . Accordingly, our aim here is to estimate Φ (m) = (φ 1 , φ 2 ) under different scenarios by using pairs (Z t , δ t ) and related censoring solution techniques. Therefore, in our simulation study, the performances ofẐ W t ,Ẑ S DT t , andẐ k t are measured based on how they converge to the real vector of parameters Φ (m). To achieve this, the performance metrics given in equations (3.1-3.3) are used. In this simulation study, we generate 500 random samples of size n = 50, 100, and200 and censoring levels CL = 5%and, 20%.
The obtained results are presented in the following tables and figures. Figure 1 presents the generated dataset for four configurations. As can be seen in panels (b) and (d), the rightcensored observations distort the data structure. Also, the lower and upper regimes can be seen clearly. It can be said that censorship makes it challenging to detect the threshold value (m), which indicates the separation of the regimes.
In table 2, the nonlinearity test results for right-censored time-series Z t are provided. Here, Tsay's F-test for nonlinearity is used under type-I error α = 0.05. The calculation of the F-statistic and further details are provided by Tsay (1986) [26]. The null hypothesis of the test is as follows: H 0 : Z t follows some AR process H 1 : Z t has a nonlinear structure According to table 2, due to the p-value (p = 0.020) being smaller than α = 0.05, H 0 is rejected and it is accepted that Z t has a nonlinear structure. After deciding the nonlinearity of the series, the optimal pair of m,d should be selected. To achieve this, a grid search is carried out based on AIC. The results are provided in table 3 and Figure 2. Figure 2 presents a grid search for a single generated SETAR series from a determined configuration. The autoregressive degrees of low and high regimes (q 1 , q 2 ) are generally chosen as q 1 = 1, q 2 = 2 or q 1 = 2, q 2 = 1. The optimal threshold value takes a value between m = 0.4 and m = 2.70. Notice that if panels (b) and (c) are carefully inspected, the  censoring level affects the choice of the optimal threshold value (m) significantly. Therefore, results prove that a high level of censorship makes it difficult to decide the regimes. Therefore, the range of m is wide. We expect to solve this problem using the three censorship solution techniques, KMW, SDT, and kNN.
From table 3, the optimally chosen orders of low and high regimes (q 1 ,q 2 ) of the SETAR models to be estimated, the threshold value (m), and the delay of the threshold variable (d) are presented for all simulation configurations. Notice that, based on AIC, the behavior ofm and (q 1 ,q 2 ) can be observed according to sample size and censoring level. Therefore, it is obvious that under heavy censorship, m is selected far from the real value of m = 0.8. On the other hand, when n = 200, AIC selects both (q 1 ,q 2 ) and (m) more accurately than other configurations. After obtaining the optimal parameters of SETAR models, the estimation can be realized.
Tables 4-5 involve the estimated autoregressive coefficients for both low and high regimes. In both tables, it can be observed that SDT and kNN obtained similar estimates, whereas KMW estimates relatively smaller coefficients. The difference between KMW and the others is that the weight matrix in KMW makes estimates more stable than kNN and SDT. These inferences are ensured by the scores of tables 4-5. The changes in estimated coefficients from lower to higher censoring levels are smaller for KMW compared to the SDT  and kNN methods. Although KMW is more robust against censorship, the performance scores given in table 6 show that its performance is worse than kNN and SDT. Hence, one may select the KMW-based SETAR model estimation for more stable estimates but for low performance. On the other hand, kNN and SDT give qualified results but they are affected by censorship more than KMW. Table 6 presents the scores of the evaluation metrics MAPE, RMSE, and MSE, respectively, for all simulation configurations. The best scores are indicated in bold. At first glance, the two expected results can be observed. These are the incremental change in performance when n increases, and the negative effect of high censoring levels on the methods. The results show that kNN dominates the other two methods in general, but there are some points to be emphasized. SDT shows good performance in terms of MAPE criterion when n = 50 and n = 200. Also, when CL = 5%, even though kNN gives the best results, SDT is right behind. However, for CL = 20%, SDT fails most of the time, and it is affected by the censoring level more than kNN and KMW. Note that, as mentioned above, the values of KMW increase less as the censoring level increases in terms of MAPE. Additionally, because kNN has no restrictions and is a fully nonparametric technique, it seems kNN has a good strength against censorship that can be understood from the MSE and RMSE values. Figure 3 introduces barplots of the scores provided in table 6. The mentioned inferences can be easily observed from the figure.  In figure 4, fitted S ET AR (2, 1) and S ET AR (2, 2) models are given with generated censored time-series (Z t ) and the completely observed series (Y t ). In these panels, horizontal dashed lines denote the optimally chosen threshold value (m) for each configuration. It is obvious that the regimes are determined correctly in the SETAR model, and the three fits represent the data well. In detail, the poor performance of the KMW fit can be distinguished from kNN and SDT, especially in panels (b) and (d), due to heavy censorship. The closeness of kNN and SDT is also observed. Note that to save space, only certain configurations are illustrated in the figure. However, the performances of the methods can be monitored from tables 4-5 and table 6. The simulation results show that all three censorship solution methods work well with SETAR, and they can be easily integrated into each other. Regarding the comparison, kNN produces the best results SDT and KMW shows satisfying results under specific conditions. KMW also has a different advantage, which is it diminishes the effect of the censoring level on the estimation performance.

Case Study: Covid-19 Data from China
This section is prepared to provide inferences for behaviors of the introduced estimators for the SETAR model for the real dataset which involves right-censored observations. In this  (see table 3) context, their performances are compared to the simulation results. Covid-19 data from China was selected for this purpose. The dataset consists of 104 data points and two variables provided by Afshin and Jorge (2020). In this study, the modeling procedure is realized for the variables of the number of recovered patients from Covid-19 (recover) as a right-censored time-series and the censoring level is 6.73%. Accordingly, SETAR model statistics based on the modeling procedures are provided in the following tables and figures. To test the stationarity of the Covid-19 series, the ADF test is made, and the results are shown in table 7. Accordingly, it is seen that the data is stationary when its 4 th lag. Hence, the remaining analysis is realized accordingly to ensure the stationarity assumption.  Table 8 involves Tsay's nonlinearity test results, which are needed to prove that the Covid-19 time-series are adequate for the SETAR type model and also that the optimal SETAR model parameters are determined by AIC. As can be seen, nonlinearity is validated for raw (incomplete) series Z t , transformed series by SDT, Z tĜ and imputed by kNN Z k t . Tsay's Fstatistic and the p-values show that the null hypothesis, which claims that the mentioned series follows some AR process, is strongly rejected. Thus, associated SETAR parameters (q 1 , q 2 ), m, and d are determined in table 8. From that, the suitable model for the Covid-19 dataset is selected as S ET AR (1, 1) . The estimated coefficients for S ET AR (1, 1) are given in table 8 for low and high regimes for all three methods. The best scores are indicated in bold. As in the simulation study, the coefficients of KMW are different from the other two due to Kaplan-Meier weights. This difference is also seen in Table 9, which includes the values of performance criteria. Notice that SDT and kNN provide closer results and S ET AR (1, 1) gives the smallest MSE and RMSE values, whereas SDT gives smaller MAPE values. These results show that the introduced estimators produce coherent behaviors in both the real data example and the simulation study. Figure 6 includes the barplots of the values given in table 10 to illustrate the difference between the performances of the methods more easily.  figure 7, fitted S ET AR (1.1) models obtained by the three methods are given. As in the simulation study, threshold values (m) for the three methods are illustrated by horizontal lines. One can here see the low and high regimes easily. In this example, the low regime represents the beginning of the Covid-19 pandemic with right-censored observations, and the high regime indicates the time when the disease peaked and then descended. In the figure, imputed values of kNN can be seen in the censored region of the series. Also, the fits of kNN and SDT are closer to each other, as previously mentioned, and KMW shows its difference by representing the real series worse than the other two methods. However, it should be noted Figure 6. Barplots of performance scores that although KMW performs worse than the other two methods in this example, it still can handle the right-censored data well, which is explained in the simulation study.

Conclusions
This paper aims to show SETAR-type model estimation when the time-series are rightcensored. In particular, the censorship solutions K MW, S DT , and kNN imputation methods are combined with the SETAR estimation procedure, and their performances are inspected practically. To achieve this purpose, a simulation study and a real-data case study are provided. Based on the results of both studies, the following conclusions are obtained: • Regarding SETAR models and right-censored nonlinear time-series, it is proven that censorship highly affects the accuracy of the estimations by manipulating data. In Table 3, one can see how the optimal parameter selection of SETAR is affected by the censoring level.
• Especially for the threshold value (m) selection, if the high regime involves more censored observations, then the detection of the regimes may be challenging. It is found that kNN and SDT-based SETAR model estimations solve the problem better than the method.
• Although KMW gives worse SETAR fits than kNN and SDT, it has a stability advantage on estimated models that are demonstrated in the simulation study. In addition, under low censoring levels, kNN and SDT give very close results, which are also ensured by the Covid-19 data example. However, under heavily censored data, kNN dominates the other methods and gives the best performance scores.
As a result, this paper suggests that under right-censored data, the SETAR model is successfully modeled based on the censorship solution techniques and it is seen that kNN imputation works well for all simulation configurations and the Covid-19 example.
In the future, as a continuation of this study, it is planned to study non-parametric and semi-parametric estimation methods to make predictions with less risk in the estimation of right-censored SETAR models in the future. In addition, it is planned to use different criteria for the selection of threshold value (m) and lag parameter (d) shown in Eq. (6) and analyze the results, which are selected to determine the upper and lower regimes.