Optimization of the amount of transmitted data in parallel algorithms for iterative methods with a triangular preconditioner

. Software has been developed in the C++ programming language using the MPI parallel programming technology, designed for mathematical modeling of the transport of substances in coastal systems. When calculating the dynamics of the spread of a pollutant, the decomposition of the computational domain was carried out to organize the computational process on a multiprocessor computer system K-60 in KIAM RAS. To solve the system of grid equations obtained as a result of the approximation of the problem, iterative methods were used, with a triangular preconditioner.


Introduction
Today, in the modern world, the issues of ecology and the preservation of the quality of coastal (especially fresh) and commercial waters are becoming more and more relevant.Environmental problems are associated not only with climate change, loss of biological diversity, but also with an increase in environmental damage from natural disasters and man-made disasters that affect water bodies and them inhabit-ants.To preserve water complexes, maintain their integrity and life-supporting func-tions, it is important not only to take organizational, engineering and technical solu-tions, but also to have highly effective methods for modeling various potential and actual mechanisms of primary and secondary pollution of coastal systems, which make it possible to quickly and efficiently based on interrelated high-precision mod-els hydrophysics and hydrobiology to predict the processes of the spread of pollution and the occurrence of hazardous phenomena in coastal systems [1,2].
To increase the performance of mathematical models based on solving diffusionconvection problems, it is necessary to include factors that have a significant impact on hydrobiological processes: parameterizable microturbulent diffusion and advec-tive transport in various directions [3].The calculation of data on a multiprocessor computer system can significantly reduce the computation time.

Decomposition of the computational domain in one spatial direction
Let us describe a method for constructing a parallel algorithm for solving the problem of pollutant transport in the two-dimensional case.The computational domain is covered with a uniform rectangular grid [4]: where , i j are the indices of the computational domain, , x y h h are the steps in spatial directions, , x y N N are the number of steps in the spatial directions, , x y l l are the characteristic dimensions of the computational domain.
At the nodes of the computational grid, the values of the water flow velocity field are calculated ( , ) u x y : ) there are fictitious nodes.Let us decompose the computational domain along the spatial direction by Oy straight lines parallel to the axis Ox , and denote the r w subdomain with the number r , 0 1 r R ó ó − , where R is the number of subdomains into which the original domain is divided.The calculated nodes of the region r w are the elements , The partition of the original region is made in such a way that adjacent regions r w and 1 r w + intersect at two nodes along the direction perpendicular to the partition lines and equalities take place ,

1, arrows show fictitious nodes.).
To represent the field value ( , ) u x y in vector form, a pair of indices , i j can be associated with a value m that describes the ordinal number of the elements of the vector u : , 0 1 x m i jN m n = + ó ó − , where n is the length of the vector ( ) This representation is convenient to use when describing and studying algorithms for solving grid equations by iterative methods.

Fig. 1. Decomposition of the computational domain
For fragments r w , obtained as a result of decomposition of the computational domain in one spatial direction, it is necessary to know two parameters: the initial index from which the corresponding fragment of the computational domain begins, can be calculated using the formula: ( ) where x úú ûû -the «floor» function is defined as the largest integer less than or equal to x, x ùù úú -the «ceiling» function is defined as the smallest integer greater than or equal to x .
The width of the subregion r w along the axis Oy is calculated by the formula: ( ) ( ) The calculation of data on a multiprocessor computer system can significantly reduce the computation time.However, the time efficiency of a computing system may not always be expected.In this case, it is correct to carry out a theoretical analysis of the calculation of the computation time based on regression analysis [5][6][7][8].
Figure 2 shows a graph of the dependence of the transfer time on the amount of data for a different number of exchanges between the nodes of the computer system.The graph shows that the transfer time dependency function has a jump when the amount of data transferred is approximately 512 floating point numbers.Let's denote this value max 512 N = .

Fig. 2. Dependence of data transfer time on volume when working with a different number of computing nodes
The following parameters are usually used to theoretically evaluate the operation of computing systems: ─ a t , the execution time of one arithmetic operation; ─ l t , time of organization of data transmission (latency); x t , the transmission time of one dat.

Calculation of latency time based on the least squares method
Let there be some variable i, which represents the i-th observation of the dependent variable i y , and let's denote the explanatory factors by the vector i x .Then the multiple regression model can be represented as follows: where  is a free term; i  -the member containing the error; 1,2,..., i p = A finite vector of dimension n is a matrix of values of explanatory factors, dimension n on the ( ) The estimate of this model for some sample will be the equation in which ( ) ... , ...
To estimate the vector of unknown parameters  , you can use the least squares method.
The condition for minimizing the residual sum of squares can be represented as: Performing transformations in ( 6) we obtain The product T Y X  is, as a result, a certain matrix with dimension ( ) ( ) where T X X is the matrix of sums of first powers, squares, and pairwise products of n observations of explanatory factors; T X Y is the vector of products, dimension n , of observations of explanatory factors and dependent variables.
The solution of the matrix equation will be the vector ( ) is the matrix inverse to the matrix of system coefficients; X Y ò is the vector of its - free members.
Knowing the vector  , any multiple regression equation can be represented as To calculate the operating time of the computing system i y , it acts as the final time, and the explanatory factors indicated by the vector i x are: the size of the computational grid, the number of computing nodes used.Thus, it seems possible to calculate the average running time of the entire system.Based on the presented regression analysis, a linear dependence of the operating time of a software module that implements a parallel algorithm was obtained on the amount of transmitted data and the number of involved computing nodes of a multiprocessor system (Fig. 3) for cases where the amount of transmitted data is less than (Fig. 3a) and more than 512 elements (Fig. 3b).Latency and data transfer times are calculated using the least squares method.The formula for latency is: Transmission time per data   To study the convergence of iterative methods, that is, to establish the validity of the equality lim 0, x is the exact solution, it is advisable to write these methods in matrix rather than coordinate form.
Represent the matrix A as the sum of three matrices: A A D A − + = + + .Obviously, the Jacobi method, using the introduced notation, in vector form takes the form: ( ) , where Represent the Jacobi method as follows: ( ) The representation of the Seidel method in vector form is: These vector equalities are special cases of the canonical form of one-step (two-layer) iterative schemes of the form: where B is a square nonsingular matrix of the nth order, it is called the preconditioner, 1 n  + is a number, which is called the iterative parameter.
The preconditioner B of the alternately triangular method can be written as where 1 2 , R R are the lower and upper triangular operators (matrices) where D is some operator (for example, the diagonal part of the operator A).

Estimation of acceleration of parallel algorithms for solving grid equations by iterative methods based on MPI
Fig. 4 shows a comparison of the acceleration of the parallel algorithm depending on the amount of transmitted data on a grid of 10,000 x 10,000 computational nodes using the Seidel method.Measurements of the calculation time were made for transmissions with a volume of 5, 10, 50, 100, 500, and 1000 elements.The greatest acceleration was observed with a transmission volume of 100 elements.With an increase in the volume of receiving and transmitting data, the speed of calculations began to decrease.This result is due to the fact that with large volumes of transfers, labor costs for exchanges between computing nodes increase, which ultimately does not justify itself.The time spent on the parallel implementation of one iteration of the Seidel method is where m is the amount of transmitted data, Q is the number of blocks, s is the step number, ( ) l t s is the time of organization of data transfer (latency).
Take the derivative ( ) Therefore, the optimal transmission volume is Similarly, the optimal transmission burst size for the modified alternating-triangular method [9] is calculated.
Fig. 5 shows a comparison of the theoretical and practical values of accelera-tion in the case of the optimal amount of gears.

Fig. 5. Comparison of theoretical and practical acceleration values in the case of optimal transmission volume
Fig. 6 shows a comparison of the acceleration for the Seidel methods with the optimal amount of transmitted data and Jacobi depending on the number of computing nodes.The calculations were made on a grid of 1000 by 1000 cells.The launches were carried out sequentially, starting from the launch on one computing node and ending with the connection of all available nodes.Fig. 6.Variability of acceleration of Seidel and Jacobi methods as a function of the number of computational nodes It can be seen from the figure that the acceleration of the algorithm that implements the Seidel method is not significantly inferior to the acceleration of the algorithm for the Jacobi method.

Conclusion
In the course of this work, a software package was developed that makes it possible to perform calculations for the problem of the transfer of matter in a shallow reservoir on various computational grids.Theoretical estimates are made to find the latency time.The parallel algorithms implemented in the software package are oriented on a multiprocessor computer system K-60 in KIAM RAS.They can significantly reduce the time of the software package with a large amount of input data.A number of experiments were carried out with different amounts of transfers for a varying number of computing nodes.Optimal volumes of transmitted data have been obtained.The presented complex can be used to study the processes of pollutant transfer in natural and technological systems.
initial computational domain and the width of the fragment 2 r N .The index number 1 r x

Fig. 3 .
Fig. 3. Dependence of data transfer time on volume a) data volume up to 512 elements; b) data volume more than 512 elements vectors of the unknown and right parts, respectively.

Fig. 4 .
Fig. 4. Comparison of acceleration depending on the amount of data transferred.