Comparative analysis of experimental data clustering in MATLAB and Python environment

Annotation. The paper compares normalization and clustering methods for processing experimental data in the MATLAB package and a Python program using the scikit-learn library. Recommendations on further application of Python programs and selection of normalization and clustering methods are provided. The choice of clustering methods for further research is discussed


Introduction
During the vegetation process, plants are exposed to various environmental influences.The changes that occur under the action of these stressors are accompanied by a change in the ionic conductivity of the plant cells, and as a result, a change in its biopotentials, which are involved in its vital activity.It has been proven that the level of plant resistance to different stressors is genetically determined and can be altered during ontogenesis.Therefore, by analyzing the changes in plant biopotentials , it is possible to assess the resistance of plants to stressors (in particular, high and low temperatures),develop quantitative criteria for intervarietal differences, and divide them into groups (clusters) with different or similar resistance characteristics.This especially applies to cereal crops when evaluating different varieties at the stage of their creation.

Problem definition
In this work the task is to identify plants with the same level of resistance to high or low temperature effects (that is, to divide them into two groups by varieties) by means of cluster analysis based on experimental data of biopotentials of two wheat varieties, as it was described in [1].Next step is to compare the results of clustering of one set of experimental data by algorithmically identical (if possible) methods in MATLAB and Python environments.Data preprocessing (normalization) is carried out using the available tools in the selected environment.
The research was conducted using the experimental setup (the structural diagram of which is shown in Figure 1) on 7-14 daily seedlings of wheat varieties "Novosibirskaya 18" and "Omskaya 18" under the influence of high (up to 40 0 C) and low (down to 8 0 C) temperatures.In the course of experimental studies, each sample was subjected to temperature effects twice, while changes in biopotential signals were recorded at three points of the seedling using a three-channel multimeter IPL 113 connected to a PC, and temperature change signals were formed using the automated system "AutoExpI" [2].The biopotential realizations obtained for the "Omskaya 18" wheat seedlings under the influence of high temperature have a form shown in Figure 2. From the entire array of collected data, those that allow to identify the distinctive features of the plants were selected.These data of different wheat varieties were combined in pairs for comparison and shown in Tables 1, 2. Based on a wheat variety with a known level of resistance from varietal tests, it is necessary to distinguish plants of the same variety with the same reaction to temperature changes from the total population of the studied plants.

Theory
Clustering of samples in MATLAB environment was performed using such clustering methods as clusterdata (hierarchical classification method) [3], kmeans (k-means method) [4], and fcm (fuzzy c-means, C-means fuzzy clustering method) [5].Prior to applying these methods, previously calculated parameters of biopotential signals were normalized using the mapminmax function (scales data to the range [-1, + 1]) [6], formerly known as premnmx, and the mapstd function (converts data to zero mean and unit standard deviation) [7] (formerly known as prestd).Both normalized and non-normalized parameters were clustered using these methods.
When writing a data clustering program in Python, the KMeans and MeanShift methods were selected (to replace the absent analogue of clusterdata) from the sklearn.cluster[8], which contains up to one and a half dozen methods, as well as the fuzzy c-means method available for installation [9].
For data normalisation the scikit-learn library provides a sklearn.preprocessing[10] package, which containes 5 methods: StandardScaler (scaling data to zero mean and unit variance), MinMaxScaler (scaling data to fit between a given minimum and maximum value, often between zero and one), MaxAbsScaler (scaling the maximum absolute value of each data point scaled to unit size), RobustScaler (recommended for data containing many outliers), Normalizer (scaling individual samples to unit norm).

Analysis of results
The results of applying the above-described clustering methods in MATLAB environment are shown in Table 3, and the results obtained in Python are shown in Table 4.These tables indicate how many times out of a total of 540 experiments with samples of 2 varieties it was possible to correctly (in bold) or partially correctly (when samples of different varieties fell into one cluster, but the cluster sizes were determined correctly, indicated in parentheses) to divide the samples by species (varieties).Voltage signals were subjected to low-frequency filtering, centering (elimination of the constant component), differentiation, after which the minimum, average, maximum values of the initial, centered, filtered, were calculated for each of the three measurement channels, along which clustering was carried out.Table 4 additionally indicates in italics the number of times that all samples of the same variety fell into one cluster, and the number of clusters turned out to be more than 2, since the MeanShift method does not specify the number of clusters.Typically, an extra third cluster included a sample whose signal parameters had significant outliers.
As an illustration of the clustering results, Figure 3 shows the separation of samples into 2 clusters using fcm method with data normalization by the premnmx function from the MATLAB package.
Figure 4 shows the results of Fcm clustering of data normalized by MaxAbsScaler in a Python program.

Discussion of results
In the MATLAB environment, the best clustering results are obtained using the kmeans and fcm methods, provided the data are normalized by any of the functions used.Clustering of non-normalized data, as well as the application of the clusterdata method, seems impractical.
In the Python environment, the MeanShift method allows samples to be partially correctly divided into clusters.However, since it does not scale by the number of clusters, its use is advisable only for processing data that do not have significant outliers.The clustering methods implemented by the Kmeans and Fcm functions allow us to explicitly divide the samples into clusters and give almost the identical results.
The StandardScaler normalization method (as well as normalization absence) give practically no results.The RobustScaler and Normalizer methods only partially separate samples into clusters and only when using the MeanShift method.The best clustering results by all three methods are achieved by the applying of MaxAbsScaler normalization.MinMaxScaler normalization gives the second best result.

Conclusion
The presented results indicate that the clustering methods implemented in Python language libraries provide generally no worse results than the methods proposed in the MATLAB package.At the same time, using of Python programs does not require the installation of proprietary (commercial) and resource-intensive software, and the sklearn.clusterclustering methods (available in the package, but not discussed here) allow us to hope for better results.

Fig. 3 .
Fig. 3. Splitting samples into 2 clusters by fcm normalized by the premnmx function from the MATLAB package.

Fig. 4 .
Fig. 4. Splitting samples into 2 clusters by FCM of data normalized by MaxAbsScaler in Python.In particular, in future studies we plan to apply spectral and agglomerative clustering methods, clustering based on affine propagation density-based spatial clustering applications with noise (DBSCAN), and clustering based on the Gaussian mixture model.

Table 1 .
Parameters of biopotential signals when exposed to high temperature.

Table 2 .
Parameters of biopotential signals when exposed to low temperature.

Table 3 .
A summary table of sample clustering for all types of data normalization and clustering methods in the Matlab environment.

Table 4 .
Summary table of sample clustering for all types of data normalization and clustering methods in Python environment.