Design of methodology for estimation of parcel investment attractiveness using spatially adjusted predictive modelling

. Determining investment attractiveness is an important task, in the meaning of economic and administration operation. The investment attractiveness of a land parcel can be assessed quantitatively as a value of dynamics in its sale price. In addition, spatial parameters can be accounted when estimating the investment attractiveness. To ensure prediction of land parcel sale prices and their dynamics we implemented ensemble regression analysis. Methodology was tested on publically accessible data of Russian cadastral agency. To elaborate an investment attractiveness clustering of cadastral units depending on price and spatial parameters, we conducted k-means clustering with Silhouette metric control.


Introduction
The investment attractiveness of a land parcel (or in a broader sense, of a real estate facility) is defined as an increase of the object sale price in a certain time interval.Assessment of the investment attractiveness of land parcels is an urgent task in the context of government and municipal management and planning, control of the development of territories, development of various business plans, as well as in the context of many other management, administrative, and economic processes.It is obvious that spatial characteristics of the land parcel (its size, shape, absolute position and position relative to other parcels and to significant objects) are affect also its investment attractiveness, being of its most significant characteristics.In other words, these characteristics compose a spatial factor of investment attractiveness.This affect is emphasized and considered in many studies [1][2][3][4][5].Usually, involved researchers understand spatial factor as a set of particular criteria that are mandatory for accounting when assessing potential sale price of the land parcel.The list of such criteria incorporates usually some transport accessibility characteristics (e.g., availability of access roads, public transport stops), administrative affiliation (e.g., indicators of region development level and(or) local socio-economic conditions), distance from social infrastructure facilities, and etc.
The formal evaluation of the land parcel investment attractiveness can be implemented as determining of the real or predicted value of parcel sale price and the price dynamics over time.We use the "sale price" term in the meaning of the parcel price specified in the sale agreement.In general, predictive valuation of the market sale price (market value estimation) of even one real estate object is a rather time-consuming task [5].Predictive assessment of investment attractiveness turns out to be a highly difficult task, being involving search, selection and comparison of several or many land parcels.In addition, as a spatial factor appears to be necessarily assessed and demanding conducting of multifactorial spatial analysis for its assessment [4,5] when assessing investment attractiveness, we can designate investment attractiveness assessment as a greatly more complex in comparison to the market value assessment.
At the same time, the tasks of prediction in many domains are automated effectively using predictive machine learning models to perform an assessment.Than it can be assumed that implementation of machine learning methods to predict the investment attractiveness of land parcels can help automate also the investment attractiveness assessment, as well as can help to provide a solution to the problem of mass land parcels assessment.In this article, we consider some results of an exploratory study related to a preliminary assessment of the possibilities for a predictive machine learning model development when assessing the investment attractiveness of land parcels taking into account spatial factor.The study was carried out on the example of the Tver region (Russia).

Data and methods
We used open data published by the Rosreestr (Russian cadastral agencyhttps://rosreestr.gov.ru) to perform experimental work.Two datasets for 2006-2021 collected over Tver region were used.The first one is a data array published by Rosreestr to provide information on completed real estate sale transactions (this dataset incorporates also some other characteristics of real estate facilities, such as addresses, area values, etc.).The second dataset was composed of annually updated values of so-called cadastral prices of the parcels (estimated originally as market values, but rarely reestimated) of the land parcels located in Tver region.These data is also available publically on Rosreestr Web site.The parcel prices are indicated in Russian rubles in the used datasets.To indicate the spatial (geographical) location of the parcels, we used the coordinates of the parcel centroids calculated basing on the coordinates of parcel boundaries.The availability of spatial location data made it possible to calculate additionally the distances of parcels to infrastructure facilities.In particular, we estimated the distances to the M-11 federal highway.
As a formal criterion of investment attractiveness assessment, we used a ratio indicator reflecting the parcel sale price dynamics over the 2015-2021: where, pr2021 and pr2015 are the parcel sale prices (per sq.m.) in 2021 and 2015 respectively.Since sale transactions for the individual land parcels (moreover, even for sets of parcels incorporated in the cadastral quarters, that are higher-level cadastral units) are not carried out annually, the array of parcel sale prices was very sparse.More than 85 % of elements were not filled with values in this array.Figure 1 shows a fragment of the array in tabular form, which allows to assess visually the ratio of filled and unfilled array elements.A data aggregation was performed in order to eliminate noise in the sale prices data (expressed as some significant sale price deviations from expected market values).Aggregation has been carried out to the cadastral quarters level (a 2,995 quarters in total were aggregated).The annual sale price values for quarters were calculated basing on the data of all land parcels incorporated into each corresponding cadastral quarter as the median values of corresponding parcel sale prices.
To predict future land parcel sale price (median cadastral quarter sale price when operating aggregated data), and to predict investment attractiveness accordingly, we implemented regression analysis methodology.In particular, we applied the CatBoost ensemble algorithm [10] that assumes use of random decision trees and additional boosting approximation combined with selection of the most accurate predicted values combination estimated according to a given optimization metric.Mean Absolute Error (MAE) estimated basing on median annual sale price values was implemented as the optimization metric, as it produces higher convergence according to the CatBoost implementation in our previously conducted studies.Classical approach to regression model training was used when the training and test samples were formed.The 80 % of the initial data were used as a test sample, while 20 % were used as a training sample.
Additionally, it have to be noted that we conducted a formal MAE-based mathematical assessment of the applicability for (each) previously used approach to land parcel sale prices time series restoration at the regression analysis stage.
The median (for cadastral quarters) values of the land parcel sale prices in the study area in 2015-2021 are ranged from 150 to 400 rubles per sq.m.Zero values filling applied to land parcel sale prices time series restoration distorts significantly the real picture in the real estate market.Consequently, the CatBoost prediction of the sale prices was produced with the 800 rubles per sq.m.MAE value, which is (in many cases) more than 2 times higher than the sale prices of land parcels located in the study area.
The recurrent filling led prediction producing with 650 rubles per sq.m.MAE value, which is also not a satisfactory result.
Finally, the cadastral price filling of the missing sale price values in the corresponding years produced the sale price predictions with 300 rubles per sq.m.MAE value.This assessment is also distorted significantly.However, application of the cadastral price filling makes it possible to perform a comparative analysis for the total median land parcel prices dynamics.The dynamics of the median parcel sale price and median cadastral price of parcel is shown in Figure 2. Being performing a comparative analysis, we observed a downward trend in the dynamics of parcel sale price median value, while the median value of cadastral price of a parcel had a slightly downward sideway trend.This behavior of the parcel sale price median may be related to the specifics of the sale transactions procedure, while it is widely applied to indicate the price in the sale contract in accordance with the correspondent value of the cadastral price [6].At the same time, both median value series are observed in the same price channel that confirms the possibility of application of the cadastral price filling to land parcel sale price time series restoration.
Aiming onto increasing of the median sale price prediction accuracy, we conducted an experiment on the dataset dimension reduction.In the explored dataset we kept only the cadastral quarters where were observed most complete data on land parcel sale prices for 2015 and 2021.Such quarters were detected on the basis of expert assessment.As a result, a set of 112 cadastral quarters was formed, which was used to retrain the ensemble model used in the experiments.At this stage the prediction was produced with the decreased 68 rubles per sq.m.MAE value.
Afterwards, the reduced dataset was used to cluster cadastral quarters according to the dynamics of the median land parcel sale price.Clustering was performed using the k-means method [11].The best (in the meaning of distribution uniformity) clustering result was gained when separation into five clusters was applied (Fig. 3 and 4).At the same time, the number of objects in clusters ranges from 10 to 48 in this case.The Silhouette metric is used to select optimal number of clusters [12].The Silhouette coefficient is calculated basing on values of mean intracluster distance a, and mean distance to the nearest cluster b.In other words, b is the distance between the centroid of cluster a, and the nearest cluster that is not a part of the cluster a.The metric value is calculated using the formula: The greater value of the Silhouette metric assumes the further away clusters location from each other in observed space, and the better partitioning respectively.The value of the metric for 5-cluster division turned out to be the maximum (0.28), whereas the value for 7cluster division was 0.19 for instance.Thus, we have identified five clusters when provided clustering cadastral quarters according to the dynamics of the median land parcel sale price.These clusters (being analyzed and characterized qualitatively) can be interpreted as classes of investment attractiveness of the cadastral quarters (and the plots incorporated) in the study area.There is a positive trend in the median value observed in cluster 2. This cluster is the most investment-attractive.The median value is minimal and negative in cluster 5, this is the cluster with the lowest investment attractiveness.Clusters 1, 3 and 4 have similar median values, and cannot be interpreted separately from each other when clustering solely on median land parcel sale price data.
In result of the data quality investigation using ensemble regression analysis, we gained also a conclusion on the importance of parameters used to form the predictive model were used.One of the most significant parameters in the studied data was the distance to the M- 11 federal highway.This significance is quite obvious, when assuming the importance of transport connectivity for the exploitation and development of territories [7][8][9].The parameter was investigated in the context of preliminary assessment of the spatial factor influence onto the investment attractiveness of land parcels.
A quasi-uniform distribution is observed when clustering the analyzed cadastral quarters into five clusters basing on two criteria (mean sale price dynamics and distance to the M-11 highway).The number of objects in clusters varies from 20 to 26 (Fig. 5-7).The value of the Silhouette metric has increased significantly, compared with the clustering on one parameter, and was estimated as 0.34.Similarly, 7-cluster division was produced no uniform distribution, while the Silhouette metric was decreased to 0.32.A positive price dynamics is observed in cluster 3 when providing a qualitative analysis of clusters.This means that the cluster 3 is most investment-attractive.The objects incorporated into this cluster are most closely located to the M11 highway (the distance does not exceed 10 km).Cluster 1 is close to cluster 3 both in terms of the sale price dynamics and the distance to the highway, but the sale price dynamics in this cluster is close to 0. Respectively, such a class can hardly be attributed to an attractive investment position.

Conclusions
As a part of the exploratory study, we evaluated the possibility predictive machine learning application to the spatially adjusted investment attractiveness assessment for land parcels.The study results allow us to conclude that the ensemble regression models, and the CatBoost algorithm in particular, ensure reliable prediction of land parcel sale price and(or) its investment attractiveness.However, it is necessary to conduct noise filtering for the initial data in order to achieve acceptable reliability.The filtering can be gained by aggregation and incomplete data elements exclusion.In addition, the results of the study demonstrated that consideration of spatial factor affect onto the investment attractiveness of land parcels can be performed without a significant redesign of the applied predictive model.
In future studies, it is necessary to explore the possibilities of spatial parameters complexing when providing prediction of the investment attractiveness.Additionally, an exploration of the possibilities for initial data filtering automation is needed.

Fig. 1 .
Fig. 1.A fragment of initial sparse array of land parcel sale prices.Such a sparse dataset is of little use when performing automated modeling.In this regard, the first task in the data processing chain was composed of sale prices time series restoration.As more than 98 % of the elements in the sale prices array for 2006-2014 were unfilled, this array part was excluded from the subsequent analysis.To restore time series, gap filling can be ensured in three ways: • Zero values filling, • Recurrent filling (replacement of missing values with values known for one of the following years, in our case), • Cadastral price filling.A data aggregation was performed in order to eliminate noise in the sale prices data (expressed as some significant sale price deviations from expected market values).Aggregation has been carried out to the cadastral quarters level (a 2,995 quarters in total were aggregated).The annual sale price values for quarters were calculated basing on the data of all land parcels incorporated into each corresponding cadastral quarter as the median values of corresponding parcel sale prices.To predict future land parcel sale price (median cadastral quarter sale price when operating aggregated data), and to predict investment attractiveness accordingly, we implemented regression analysis methodology.In particular, we applied the CatBoost ensemble algorithm[10] that assumes use of random decision trees and additional boosting approximation combined with selection of the most accurate predicted values combination estimated according to a given optimization metric.Mean Absolute Error (MAE) estimated basing on median annual sale price values was implemented as the optimization metric, as it produces higher convergence according to the CatBoost implementation in our previously conducted studies.Classical approach to regression model training was used when the training and test samples were formed.The 80 % of the initial data were used as a test sample, while 20 % were used as a training sample.Additionally, it have to be noted that we conducted a formal MAE-based mathematical assessment of the applicability for (each) previously used approach to land parcel sale prices time series restoration at the regression analysis stage.

Fig. 2 .
Fig. 2. The dynamics of the median parcel sale price (in the left image), and the median cadastral price of parcel (in the right image).

Fig. 3 .
Fig. 3. Separation of cadastral quarters into clusters according to the median land parcel sale price dynamics in 2015-2021.

Fig. 4 .
Fig. 4. Distribution of the sale price values in the produced (Fig. 3) clusters.

Fig. 5 .
Fig. 5. Separation of cadastral quarters into clusters according to the median land parcel sale price dynamics in 2015-2021, and the distance to the M-11 highway.

Fig. 6 .
Fig. 6.Distribution of the sale price values in the produced (Fig. 5) clusters.

Fig. 7 .
Fig. 7. Distribution of the values of distance to the M-11 highway in the produced (Fig. 5) clusters.