Geographical origin identification of teas using UV-VIS spectroscopy

. In this work we proposed a method to verify the differentiating characteristics of simple tea infusions prepared in boiling water alone, which represents the final product as ingested by the consumers. For this purpose, total of 125 tea samples from different geographical provines of Vietnam have been analyzed in UV-Vis spectroscopy associated with multivariate statistical methods. Principal Component Analysis-Discriminant Analysis (PCA-DA), Partial Least Squares Discriminant Analysis (PLS-DA) and Artificial Neural Network (ANN) were compared to construct the identification model. The experimental results showed that the performance of ANN model was better than PCA-DA and PLS-DA model. The optimal ANN model was achieved when neuron numbers were 200, identification rate being 99% in the training set and 84% predition set. The proposed methodology provides a simpler, faster and more affordable classification of simple tea infusions, and can be used as an alternative approach to traditional tea quality evaluation.


Introduction
The tea (Camellia sinensis) has long been a familiar drink among humans and is widely used in many countries around the world. The researches regarding the chemical compositions of tea show that the tea leaves contain approximately 2000 chemical compounds; in which, the polyphenols account for about 25 -30%, caffeine accounts for about 2 -5%, the theobromine and theophylline account for about 0.2 -0.5% of the dry matter content [1]. The special chemical composition, especially the content of methylxanthines, polyphenols determine the quality and economic value of the finished tea [2]. Many studies have shown that besides the seed factors, farming regime, processing techniques, climatic and soil conditions have a great influence on the chemical composition of tea [2]. Therefore, the finished tea products in different geographical regions can have different economic values. Currently, due to the great demand for tea, especially those of high quality, the famous tea products are being counterfeited by the brands with unknown quality and unclear origin. To address the above challenge, the quality control and traceability of the tea products gradually become inevitable needs, bringing practical benefits to the manufacturers and consumers. Vietnam has 35 tea producing provinces with a total area of 125,000 hectares, most of them in the Northern Midlands, North Central and Central Highlands provinces. Vietnam has exported tea to 107 countries, ranking 7 th in export volume (987.3 thousand tons in 2018), ranking 6 th in export value. However, the lacking of quick and reliable analysis methods for determination the origin and quality of tea products is one of bottle-neck to ensure and increase the quality of the products.
The prior studies focused on the differentiation of countries of production or tea varieties according to the mineral composition by using inductively coupled plasma atomic emission spectroscopy or according to the contents of catechins and purine alkaloids by using high performance liquid chromatography (HPLC) with pattern recognition methods [3-5-4]. The main drawbacks of these methods are the use of expensive equipments, usually involving large operational or maintenance cost, and the use of a number of reagents to carry out the extraction of organic compounds, the derivatization of analytes or the mineralization of samples for elemental analysis. Another approache has become more trending recently is the combination of spectral analysis methods. Ultraviolet -Visible spectroscopy (UV-VIS) and Infrared reflectance spectroscopy (IR) combined with multivariate analysis techniques [6][7][8][9][10][11][12][13][14][15]. The identification is intended to categorize the food source that is growing vigorously. Compared to other chemical analysis methods, the advantages of the spectral analysis methods are low cost and high-throughput since the tedious sample prepration is not needed. UV-VIS has already been used to differentiate between black, green, and Pu-erh tea varieties [17]. However the method used methanol as the extract solvent, which produced a very broad spectra with highly correlated variables, and required more sophisticated non-linear classifier, Support Vector Machines. Furthermore, the methanol is toxic to humans and the environment [18]. The geographical classification of teas applying IR spectroscopy. Can simultaneously classify both geographic origin and variety for teas has been proposed using digital images [9]. However, these methodologies were carried out directly on the tea as contained in commercialized bags, whereas the infusion represents the final product as ingested by the consumer. In this study, we propose a fast and simultaneous method for identifying the origins of green tea products in several provinces of Vietnam by combining of UV-Vis sepectroscopy analysis and Principal Component Analysis-Discriminant Analysis (PCA-DA), Partial Least Squares Discriminant Analysis (PLS-DA), and Artificial Neural Network (ANN).

Material and methods
The molecular absorption spectrometer analysis using scientific series 1700 PC -Shimadzu (Japan) was used. Each spectrum consists of 211 values of intensities at 1 nm intervals in the wavelength range 190 to 700 nm with a quartz cuvette with optic path of 10 mm.
A total of 125 green tea samples with known geographical origins was collected from the provinces of Thai Nguyen (35 samples), Lam Dong (25 samples), Phu Tho (31 samples), and other provinces such as Ha Giang, Yen Bai, Tuyen Quang, Hoa Binh (34 samples) was collected . The original information of the samples were recognized by onsite collection (for setting up the model) or based on the package (for comparison of the geographical origin between predicted and training results). Approxiamtely 100 g of airdried tea-leaves were kept at least 2 days in the dehumidifier at the 30% moisture before analyzing. The dried tea samples were ground in a laboratory grinder to obtain tea powder through to 63 μm mesh BS sieves .
Approximately 1.0 g of tea powders, were accurately weighted, extracted twice with 50.0 mL double-distilled water at temperature of 90 -95 0 C for 5 minutes each extraction [10]. The tea solution were filtered with filter paper (0.45 μm), and diluted to 100 mL with double-distilled water. The 1.00 mL of this solution were transferred to a 25mL volumetric flask rising with with distilled water. Matlab R2016a (The MathWorks, USA) was used for statistical analysis. All 125 samples were randomly divided into two datasets, training and external validating sets with the rate of 80/20. The external validating set consisting of 25 samples was not used for the model construction, hence reflecting the predicting power of the model. A data matrix consisting of columns (absorbance at the selected wavelengths) and rows (the tea samples) was created for the chemometric calculations. PLS-DA, PCA-DA, ANN were applied to obtain the classification models.

UV-Vis absorbance spectra
The UV-VIS spectra of the sample is shown in figure.1a. The most informative range of the spectra is found in the UV region (190 -490 nm), while the information of the region above 490 nm is not important. The absorbance in the range 250 -350 nm is associated with the n→π* transitions of methylxanthines and catechins [16].

Selection of principal components
PCA is a statistical method to transform multiple indicators into several representative aggregative indicators. Redundancy information is reduced from a high-dimensional space to a low dimensional space by using PCA. The vectors obtained from each principal component are orthogonal. As shown in Figure 2, the first principal component (PC1) accounts for 88.76% of the variance. It is explained that the first component represented 88.76% of the information of the green tea samples and only the first PC was used to setup the classification model. PLS-DA is a modification of Partial Least Squares Algorithm for classification purposes. It is based on the PLS2 algorithm that searches for latent variables with a maximum covariance for the categorical variables (Y ) [20]. As shown in Figure 3, PC1 only accounts for 20% of the total variance of the data set. If PC11 is added, the cumulative total variance is over 95%. Therefore, we choose PC number = 11 to start investigating to choose the optimal number of PC .

PCA-DA and PLS-DA model performance
Two supervised recognition techniques, PCA-DA and PLS-DA, were performed with PC 1~20. Figure 4 and figure 5 represent the recognition results obtained by the PCA-DA and PLS-DA approaches in training and prediction sets. Compared to PCA-DA model, the PLS-DA model was able to deal with imbalance training samples and the prediction set. The ability to determine the origin of a tea sample in the prediction set shows that the PCA-DA model reached the highest classification accuracy of 84% with PC12, while PLS DA performed only 76% with PC18. Figure 6 represents recognition results obtained by the PCA-DA with PC12 model, the geographical origins of the samples TN1, TN4, PT2, and PT6 in the external prediction set were wrongly indentified.Thus, the accuracy of the model PCA-DA identifying 25 test samples reaching 84% is not high. Therefore, the study continues to investigate the ability to identify the geographical origin of tea samples based on UV-Vis spectral data using the ANN algorithm.  The results of figure 8 is found that the results of classifying the test samples from 4 different geographical regions have a high accuracy of 96%. Out of 25 samples, the ANN model correctly identified 24 samples, and 1 sample in Phu Tho province incorrectly identified Thai Nguyen sample. Therefore, ANN is the suitable method to determine the origin of a tea sample.

Conclusions
This study used ANN as a pattern recognition tool to develop an identification model. We show that UV-Vis spectra coupled with ANN pattern recognition can identify the country of origin of green tea samples. The ANN algorithm outperforms the PCA-DA and PLS-DA approaches in identifying the geographical origin of the tea samples. Therefore, UV-Vis spectra analysis with ANN pattern recognition could be used to identify the origin of other agricultural products. The proposed classification method is an alternative to those using expensive equipments or involving tedious sample preparation. UV-Vis spectra are easily obtained with operational simplicity and low cost and the applied pattern recognition calculations are well known.
This study is a part of project DH2018-TN04-03 funded by Thai Nguyen of Education.