Optimization of production and transport infrastructure based on cluster analysis methods

In order to solve the problems of optimizing production and transportation systems, a clustering procedure for objects is suggested. The procedure is a universal methodology for dividing a set of objects into subsets with their centers possessing optimal properties. At the same time, the use of point proximity metrics used in cluster analysis models the minimization of distances during transportation. If the volume of produced/extracted containerisable products of a production point is considered as the “weight” of each point, than the problem of minimizing transportation costs can be solved as a problem of optimizing clusters and their centers. A set of analytical models has been developed to describe and optimize the choice of location and number of container terminals (CT) at the first level and container storage and distribution centers (СSDC) at the second level of a two-level terminal model and logistics infrastructure of the container transport system (CTS). New clustering algorithms are suggested to determine the locations of CT and СSDC based on the condition of minimizing transportation costs and creating a terminal and logistics infrastructure, taking into account given or random number of clusters.


Problem statement
The research concepts of transport systems have not undergone significant changes up to date and are based on the approach to transport as a supporting system of the country's economy.
Historically, the leading link in the Russian transport system is railway transport. Its effective functioning plays an exceptional role in creating the conditions for modernization, transition to an innovative development path and sustainable growth of the national economy. The transport component in the final price of the product is an important factor when selecting the carrier and transport type by the cargo owner and ultimately depends on the condition and development of the transport system, the availability of infrastructure, changes in tariffs, and the offered logistics service.
In this paper, in order to remove the restrictions associated with insufficient development of the transport infrastructure, as well as for the goals of further country's economy growth, to identify development reserves, a model of a container transport system that defines transport and economic relations will be considered based on the solution of the production-transport type problem. This model will allow solving issues related to the costeffective distribution of produced (extracted) containerisable products between consumers, taking into account transport costs and the costs of developing terminal and logistics facilities, as well as other factors.
In order to solve the problem in a most efficient way, it is necessary to take into account the volumes of production and consumption with their given geoinformation parameters. The throughput capacity of transport systems, cost of transport services, the costs associated with the development of terminal and logistics facilities necessary for processing the planned cargo flow should also be considered. Accordingly, the optimality criterion will be the total costs associated with production, the development of the throughput capacities of transport facilities, and the transportation process. The following additional criteria can be taken into account when stating and solving the problem: the presence of international transport corridors passing through the territory of particular region; level of container appeal of the region; infrastructure readiness, etc.
As a rule, production and transportation models when solving are reduced to problems of the transport type (matrix and network transport problem), linear programming, integer programming, network planning [1][2][3][4][5][6][7][8]13]. However, as analysis has shown, a feature of these problems is a large dimension, which does not always allow solving them in a nonlinear setting. These models are usually reduced by linear approximation to linear problems.
Solving optimization problems of choosing the location of terminal-logistic objects using graph models and mathematical programming under many combinatorial constraints leads to complex computational procedures of the exhaustive nature, which does not allow their use within the territories of the federal districts or the whole country. An analysis of existing practical methods [4][5][6][7][8][9][10]13] showed that they do not take into account the number of objects, their location relative to industrial production, the volume of cargo flows from individual consignors and consignees, and the existing railroad topology.
In conjunction with the foregoing, in order to solve the problems of optimizing production and transportation systems, a procedure for clustering objects is proposed. The universal methodology for dividing a set of objects into subsets with their centers possessing optimal properties was implemented. According to the methodology, the use of proximity metrics of points used in clustering models the minimization of distances during transportation, and if the volume of produced/extracted products of a production point was adopted as the "weight" of each point, then the problem of minimizing transportation costs can be solved as a problem of optimizing clusters and their centers.

Two-tier model of container transport system
The conceptual model of the country's CTS is proposed in works [3,23,24]. The main idea is the formation of a two-tier infrastructure with service centers at each tier ( Fig. 1). This will allow not only creating and selling customer-oriented transport products, such as organized container trains, for example, but also will predetermine a new stage in the development of CTS. A two-tiered CTS will allow concentrating the volumes of containerisable products necessary for the formation of container trains, excluding long periods of their accumulation, and also increasing the delivery speed of goods in containers. At the 1st tier the CT network provides for the collection and concentration of goods from production points. Here the task of choosing CT locations arises (linking production points to their CTs). At the 2nd tier the СSDC network carries out the concentration of container flows received from the CT, where accelerated container trains going to other СSDC are formed. This also raises the problem of selecting the location of the СSDC from the optimality condition of some economic cost criterion.
The mathematical model of such two-tiered network presumes that the geographical coordinates of the production points (xi, yi), the cargo volumes of these industries vi and the coordinates of the regional railway stations (xj, yj) are given. This allows stating the problem of the optimal location of CT and СSDC as a two-tiered clustering problem, where the criteria are the total costs of freight transportation and the costs of creating a network of CT and СSDC. In the current research, the idea of the famous McQueen k-means clustering algorithm was taken as basis [20], which, for a given number of clusters, finds clusters and their centers by the criterion of the minimum sum of squares of distances from points to their centers.
In order to solve the problem of choosing the location of CT or СSDC, it is necessary that the cluster centers are not at any geographical point, but necessarily in one of the given railway stations. Thus, a new algorithm called k-means pro is proposed, in which the resulting geometric center is projected to the nearest train station at each iteration.

K-means pro algorithm
Here is a description of the k-means pro algorithm. Let us adopt an initial set of objects J (j=1,n) to be clustered, characterized by their coordinates X = {x1,…,xn}, their weights V = {v1,…,vn} and an admissible set of projections P ( r=1,p) characterized by their coordinates Y = {y1,…,yp}. Each j-th object and each r-th permissible projection point are given in the G-dimensional space R G , i.e., xj = (xj1,…,xjG) and yr = (yr1,…,yrG). Let us denote the division of the original set into k clusters as a set of subsets The parameter k is given, which is the number of clusters the X set is divided into. As a result, it is necessary to obtain the optimal partition * , which centers are the optimal set of projections * CY  . Let us denote the following: n is the number of clustering objects, p is the number of points of the admissible set of projections, i, i' is the cluster number, j is the number of the object, r is the point number of the the set of projections, l is the number of the point coordinate, m is the current iteration, G is the dimensionality of space in which clustering is performed.
The distance between the points tl and t2 in a G-dimensional space according to the Euclidean metric is as follows: 1. Select the initial division 0 where ni is the number of points of i-th cluster.
3. Let us calculate the set of projections of means for the current partition: 4. Let us calculate the partition m С generated by the set and take it as The functional F(S) does not increase on the sequence of divisions S 0 , S 1 ,…,S m ,…, and F(S m ) = F(S m+1 ) only if S m = S m+1 . Therefore, after a finite number of steps, algorithm ends for any initial division S 0 . In our case, the achieved optimization criterion for the found centers ci * has the following form: As a result of the algorithm operation, each time local minimum F(S) is obtained, and the clustering result depends on the selection of the initial standard е 0 . The е 0 coordinates can be obtained in various ways. For example, they can be taken as random numbers evenly distributed among the possible coordinates of the starting points. In order to check the stability of the results and to obtain various averaged dependencies, the choice of е 0 can be changed.
It should be noted that solving the 2nd tier problem, which is selecting the location of the CSDC, is also done using this clustering algorithm. Here the points for clustering are CTs, and the resulting cluster centers determine the places of developing the CSDC. For the final choice of locations for the CSDC, it is advisable to supplement the analysis of these already selected locations based on additional factors [23,24].
It is natural to assume that the number of clusters is also unknown and K must be determined according to the condition of minimum costs.
It was shown in [23][24][25] that with an increase in K, the F value decreases sharply at first, and then more slowly. That is, transportation costs are reduced with an increase in the number of constructed CTs and CSDCs (Fig. 2). But at the same time, the costs of developing CTs and CSDCs will increase. The total costs associated with developing the first tier of the container transport system can be expressed as follows: , (8) where c is the standardized reduced cost of developing one CT.
A similar formula will be used for developing a second tier of CTS. The function E(K) is shown in Fig. 3. A minimum of total costs can be obtained by varying the K value, for which it is necessary to solve the problem again for each K [25].
Based on the developed software, calculations were carried out for the production and transport complex of the Volga Federal District of Russia, and several specific tasks for creating a two-tiered CTS were considered.

Clusterization algorithm based on proximity matrix
Despite the great advantages of the cluster approach and the developed k-means pro algorithm (the mathematical validity of optimality, the low growth of computational complexity for large problem dimensionality), there is a drawback in the suggested model that makes it difficult to put into practice. The efficiency function uses the proximity measure during optimization as the Euclidean distance between two points i and j: The actual distance between the real production points and the point stations does not coincide with the Euclidean, because it does not take into account the tortuosity of roads, the presence of rivers, bridges, traffic congestion, the need for increased delivery speeds and more. The use of other metrics, in particular the "city-block distance" metric (Manhattan distance) is not suitable for road and rail traffic.
A discrete model for creating a CTS is developed and solution algorithms eliminating this drawback are suggested.
Let us assume that a two-tiered CTS is determined by the set of production points I with their numbers i=(1,m), the volumes of containerisable products of production vi, the set of points of stations J with numbers j=(1,n) and the proximity matrix D = {dij}. Each element of this matrix expresses the value of the criterion by which the costs (losses) are measured when transporting goods from i to j. In the simplest case, it can be distances in kilometers, but the whole variety of factors can be taken into account, affecting the costs of transportation from i to j (qualitative characteristics of the delivery route, non-linearity of the volume, delivery speed, seasonality, etc.). Further, the quantity dij will be called distance.
Then the task of choosing the locations of the centers (clusters) is stated as follows. The disjoint subsets of I set points, which are Sl clusters, need to be determined so that the total weighted distance from the production points to their cluster centers, belonging to the set of stations J, is minimal , where dil is the distance between the i-th production point, which is part of the l-th cluster, and one of the points of the railway stations, which is the center of the l-th cluster.
The F value depends on dividing into clusters Sl and on selecting the numbers of railway stations (K in total), which will be the centers of the found clusters.
Let us consider a clustering algorithm based on the distance matrix D={dij} for each fixed K. Initial data for the algorithm operation are the following: the set of production point numbers I, i = (1,m); the set of railway station points numbers J, j=(1,n); distance matrix D={dij}; "weights" of production points vi, representing, for example, the volume of containerisable products produced/extracted by the i-th production; station distance matrix B={bjj'}.
Algorithm: 1. K stations are selected and declared as the centers at the first iteration, the standard. The choice of this first standard, strictly speaking, affects the result of the algorithm. In general, as suggested in [25], the composition of the standard can be randomly selected. (a more advanced method of developing a standard will be considered below).
2. The each i number is selected and the center from the set of centers m is determined, to which the distance dil is minimal. Thus, all production points are tied to the centers and Sl clusters are obtained. l=(1,K).
3. For each cluster l, a new center is found, for that, a railway station from the set J is found in each cluster for i∈Sl, for which the total weighted distance dj is minimal .
Thus, new centers for each cluster are found. 4. Having ontained new centers, proceed to step 2 and obtain new clusters, etc. Repeat procedures 2,3,4 until the resulting clusters and their centers begin to repeat.
The end of the algorithm.

The algorithm for obtaining the first standard
Achieving the best option will be more systematic if the initial centers are as far apart as possible.
In order to improve convergence, the following algorithm for obtaining the first standard can be implemented.
Algorithm: 1. Choose a random number from the set of numbers of railway stations J and find another railway station, which is the farthest from the first largest djj. These two railway stations are already elements of the initial standard.
2. For each railway station that is not included in the standard, the nearest railway station from the standard can be found using djj value. Two clusters of stations are obtained.
3. In each such cluster the farthest railway station from the element of the standard is found and the station, at which the distance to the standard is the largest in the cluster, is included in the standard. The third standard railway station is determined. 4. We proceed to point 2,3 and so the 4th, 5th and k-th railway station of the standard are obtained.
The end of the algorithm. Figure 4 shows the testing results of this algorithm in comparison with a random selection of standard. It can be seen from the presented correlation that, on average, the number of iterations of the main clustering algorithm decreases when using the algorithm for obtaining the first standard. The developed applied toolkit as a software product for designing the terminal and logistics infrastructure allows implementing the proposed models and algorithms for the optimal selection of the location for the CTS infrastructure for its organization and costeffective functioning under various optimization criteria. It also allow receiving both tabular digital data for outline design and a graphic image of locations of designed terminal and logistics facilities on a territory map.
Thus, the proposed model and algorithms for solving the CTS optimization problem using cluster analysis methods for the long-term planning and development of the terminal and logistics infrastructure make it possible to solve the problems of optimizing the placement of CTS infrastructure facilities, determining their required quantity and capacity, taking into account minimization of the transporting cost for goods and investments.