Development of explicit algebraic models of Reynolds stresses for flows in channels using gene expression programming

. To build a data-driven approximation for the Reynolds-stress anisotropy (RSA), the symbolic regression method of gene expression programming (GEP) is applied. Two tensor-basis terms from the algebraic expansion for the RSA tensor are used in the GEP algorithm. A new RANS-GEP model is tested in several flows in channels without/with bumps at different physical and geometrical parameters, where DNS data of high fidelity involved as a target for RSA are available. The results of RANS-DNS runs are also obtained where the RSA values in the mean momentum equation are taken directly from DNS to show ability to improve the model performance versus the conventional linear eddy viscosity model (LEVM). Next, the training with carefully selected input features is performed to get an explicit non-linear algebraic model for RSA. The results of RANS-GEP study show potentials of the new tool to improve predictions.


Introduction
Application of eddy-resolving techniques (DNS, LES, measurement), which generate high-fidelity (HF) data, to industrial and environmental problems is restricted by high computing cost.In contrast, RANS closure models are computationally efficient so still contribute greatly to practical flow predictions [1], but often give visible deviations from HF data.Modification to RANS models improve predictions, however its discrepancy from HF data remains considerable [2].Machine learning (ML) methods can help in data-driven turbulence modelling to enhance the model performance [3].
So far, several ML methods have been proposed to enhance RANS models: tensor basis neural network (TBNN) [4], field inversion and machine learning [5], physics-informed machine learning [6], convolutional neural network [7], tensor basis random forest (TBRF) [8,9], universally interpretable machine learning [10], tensor basis gradient boosting [11].Its workability is inevident due to the black-box character of ML tools.On the other hand, to build explicit non-linear relations for the RSA tensor from data, symbolic regression can be used [12][13][14][15].In particular, multi-dimensional gene expression programming (MGEP) [12,13] and sparse regression [14] have been built as symbolic regression techniques to show possibilities to enhance the model performance in flows in channels of different geometry.
In the present work, the first two tensor-basis terms for the RSA components are taken to generate a new algebraic model by MGEP in canonical channel flows with periodic hills (PH) at different slopes and Reynolds numbers, and that in a square duct (SD).It has extensive data of DNS, LES, RANS computations [16][17][18][19][20] to be used for input features and for a comparative analysis.

Methodology
To close the RANS equations, the non-linear expression for the Reynolds stress tensor τij has been generated, for calibration of which, the MGEP tool [12] can be taken.It allows one to write RSA, aij 1 2 where gn(λ1, λ2) are linear and quadratic functions of scalar invariants λ1 = sij sji and λ2 = rij rji , tensors T n are functions of the strain (sij) and rotation (rij) tensors.Note that the first term, bij = -sij, which is the LEVM model, is written separately, and the k-ω model [21] is used to find the turbulent kinetic energy k, and its dissipation ω.
The tensor basis {T 1 , T 2 } is shorten in comparison to other MGEP techniques having taken three to four terms {T 1 , …, T 4 } [12,13] or to the TBNN and TBRF tools with the complete basis {T 1 , …, T 10 } [4,8] since the first two terms contribute more than others [10].
The MGEP algorithm can be applied to handle with scalar, vector and tensor fields, for example to generate an explicit RSA relation based on HF data for canonical turbulent flows, following the steps: • Define rates of genetic operations; choose a fitness function to evaluate the offspring, set the population size, the gene head length, the number of generations (iterations), those of genes in chromosome and plasmid.
• For the first generation, randomly generate the population of multi-gene chromosomes.
E3S Web of Conferences 459, 02005 (2023) https://doi.org/10.1051/e3sconf/202345902005XXXIX Siberian Thermophysical Seminar • Iteration start: find the fitness of each individual and then choose the individuals satisfying the criteria; besides, collect the best individual separately.
• Apply genetic operations for chosen chromosomes, except the best chromosome.
• Gather plasmids of chromosomes, which encounter the genetic operations; exploit the genetic operations for the plasmids; distribute them to chromosomes.
• Produce the new generation using the chosen chromosomes and the best chromosome.
• Iteration end: go back to iteration start until the finishing the iterations due to their prescribed number.
Eight genetic operations to optimize chromosomes from generation to generation, are: mutation, inversion; one-point, two-point, gene recombination; transposition of insertion sequence elements, root/gene transposition.
More details of MGEP can be found in [12,22].
For target values taken in the fitness function, the HF data for components of non-dimensional or dimensional RSA (bij or aij) are considered.
Input features (scalars λn, tensors T n ) can be either from HF data as in [12], or from RANS solutions as in TBNN, TBRF [4,8], or from both HF data and RANS solutions [13,14].

Training and testing cases
The DNS data [17] in a PH flow at Re = UbH/ν = 5600 (H is hill height, Ub is bulk velocity) and slope a =1 have initially been taken for training, with the baseline k-ω model results found using OpenFOAM [18].For testing, PH flows at other slopes a = 0.5, 1.5 are available too [17,18], as well as the LES data at Re = 10595 [16].
Next, a developed flow in a square duct at different Reynolds numbers Re = UbD/ν = 10320 (where D is duct width) [19] and 2154 ≤ Re ≤ 7000 [20] are examined as training and testing cases to build MGEP models.

Results
The way of [12] (called MGEP-WS), with quantities in (1) from the available HF data, is realized first (Fig. 1).Second, according to [23], the LEVM term, -2(k/ω)sij, in (1) is defined from the basic k-ω model, whereas the remaining terms of the basis {T 1 , T 2 } are taken from the HF data.Such a way is called MGEP-AFA (Fig. 1).
At the beginning, ten different runs were made using the MGEP-WS algorithm with input features taken from DNS [17], where ten different bij approximations were obtained.Here Ns = 148 data points located at y/H < 1.58 were selected from the whole data set having 14960 points, to save computing resources (increasing greatly for MGEP with Ns).Note that even smaller Ns = 100 was taken in [12] for the training dataset.
Moreover, the population size (i.e.number of individuals) has been set to 300 which is 1.5 times higher than that in [12,13].Finally, number of generations (iterations) was assigned to Ns = 1000 which is between 300 as in [12] and 1500 as in [13].Such a number of Ns was enough to get the convergence of the fitness function.
The gene head length was chosen to be 3 due to recommendation of the original GEP approach [22], whereas numbers of genes in chromosome and plasmid were taken to be 2 which corresponds as in [12] to numbers of basis tensors {T 1 , T 2 } and scalar invariants (λ1, λ2).
Then, five of ten expressions (having a minimum root mean square error (RMSE) determined from the difference between MGEP and DNS data for bij [4]) were chosen, and the average expression for the RSA tensor was found with the following functions A priori estimation, i.e. substitution of the data of [17] into the LEVM model, where g (1) = g (2) = 0, and into the MGEP model, demonstrates (Table 1) the ability of ML techniques to enhance the baseline turbulence model.Next, the same procedure was performed to train the MGEP model for aij with input features from DNS [17] collected in the shortened data set as mentioned above, fifty bij approximations were obtained, then six were selected with minimum RMSE values, and the average expression was generated with the following functions g (1) = 0.095 + 1.165λ1 − 0.142λ2 − 0.167λ1λ1 + 0.167λ1λ2 (4) g (2) = −1.524+ 1.842λ1 − 0.083λ2 − 0.167λ1λ1 (5) Again, a priori estimation shows (Table 1, Fig. 2) the ability of ML tools to enhance the LEVM model.
However, after propagation of the expressions for both bij and aij into the momentum, k and ω equations (in the modified simpleFoam solver in OpenFOAM), no visible improvements were observed (Fig. 3).A possible reason of such a disadvantage is that using the HF data for training as in [12] may lead to inconsistence after its propagation, since the k and ω values generated by the k-ω model are quite far [7,9] from their DNS and LES counterparts used for training.So we have examined other ways to implement the MGEP techniques.4), (5).
First, the RANS-DNS runs (denoted also as RDNS) are made for both PH and SD flows, where the 'frozen' aij values taken directly from [17,19] are inserted into the RANS equations, and the velocity prediction is visibly improved versus that by LEVM (Fig. 4-5).
Note that the simple RSA propagation with the 'explicit' treatment of the 'frozen' aij values yields spurious waves on the distributions and often leads to divergence seen by us and in [6,8,23].Instead, we apply the 'implicit' treatment for the RSA propagation, which gives stable computations [6,8,23] and implies splitting where k and ω are obtained from the standard k-ω model [23], and the last term is frozen from DNS data [17].
Then, a posteriori study is performed by inserting the RSA model obtained by MGEP-AFA into a RANS solver (such a way follows the 'implicit' RSA treatment in RANS-DNS [23]) and testing it for the PH and SD cases at different physical and geometrical parameters.
In particular, for the SD case at Re = 10320 (with input features from DNS [19] collected in the shortened data set with Ns 145 points from the duct crosssection), the MGEP model which turns to have a quite simple form after its propagation into the momentum, k, ω equations, yields improvements towards the target data, although under-prediction of lateral velocity in a secondary flow can still be observed (Fig. 5).

Conclusions
The present study explores different ways to build the explicit non-linear algebraic models for the Reynolds stress anisotropy tensor by means of the ML technique of multi-dimensional gene expression programming.
The visible improvement of predictions can indeed be achieved using the MGEP models for flows in channels of different geometries.
Ways to build turbulence models by ML methods for other complex-geometry turbulent flows having lateral and longitudinal obstacles, with heat and mass transfer, is to be considered too in the future studies.

Fig. 5 .
Fig. 5.The vertical (10 3 V) and horizontal (U) mean velocity components in a SD flow at Re = 10320 used for training.

Table 1 .
[17] values relative to the DNS data[17], obtained from a priori estimation for PH flow.