Evaluation and application of mitochondrial CO І gene in identification of endangered wildlife in multi-species mixed biological samples

In this study, the second-generation high-throughput sequencing and DNA barcoding were combined to manually prepare multi-species mixed samples, and the mitochondrial gene CO І was used as a barcode to simultaneously identify the animal species in the mixed samples and identify endangered species. The results showed that under the family and genus level, the simultaneous detection rate of the species in the mixed samples was as high as 100%, and the species identification rate was as high as 89% at the species level, and with high sensitivity, as little as 1% of the trace species could be detected. However, nearly 30% of non-target classification annotations appeared at the species level. It can be concluded that the mini CO I barcoding can be applied to the simultaneous identification of animal species in mixed biological samples, and the species identification rate is high. Non-target classification match existing at the species level can be further improved by increasing the length of the barcoding, improving the sequencing technology, reference database and so on. In this study, DNA metabarcoding technology was used to evaluate the feasibility of identification of endangered animals in multi-species mixed biological samples with CO І, in order to lay a preliminary foundation for the advancement of DNA metabarcoding method in the field of wildlife forensic identification.


Introduction
Species genetic identification plays a key role in the investigation of illegal trade in endangered wildlife and food adulteration. Currently, DNA barcoding technology is an established molecular technique for species identification using standardized short DNA sequence. In recent years, DNA barcoding technology has been widely used in food supervision due to food labeling errors, food adulteration and food contamination [1][2][3] , such as traceability of processed foods such as seafood and meat products. The correct identification of the species contained in the food is essential to protect consumers from potential food adulteration, mislabeling of ingredients or food poisoning.
Another mature application of DNA barcoding in forensic science is to investigate illegal criminal activities such as illegal acquisitions, transportation, trading and smuggling of endangered wildlife. Species of flora and fauna are listed as endangered species by the Convention on International Trade in Endangered Species of Wild Fauna and Flora (CITES). Appendix I, II, and III list protected species based on the extent to which a population is threatened with extinction. In addition to regulated legal trade, a large part of the trade in endangered animals and plants is illegal. According to statistics, from 2004 to 2013, the national customs criminal case investigated and dealt with 930 cases of smuggling of precious species, and seized 710 tons of precious species; nearly 10,000 illegal trade cases involving administrative investigations were investigated. In some cases, it is not difficult to identify the species of plants and animals involved, as long as the form is complete, it can be identified by traditional morphological classification. Identification is more difficult when only some animals or plants have no apparent morphological characteristics, or when plants or animals have been pulverized into articles. For samples that are unrecognizable in this form and do not know the composition of their species, a standardized, fast, and reliable method of identification is helpful and necessary for law enforcement. It is these advantages that make DNA barcoding the preferred method of identifying endangered species in animal and plant products.
In the case of species identification of traditional Chinese medicine, a mixed sample containing more than one component, DNA barcoding technology based on the generation of Sanger sequencing is not applicable. These samples usually contain multiple species, and these species can only be efficiently analyzed if multiple DNA barcoding can be sequenced in parallel, and second-generation sequencing technology can effectively sequence these species. Therefore, this study combined the second generation of high-throughput sequencing with DNA barcoding technology (metabarcoding: DNA metabarcoding), and manually prepared mixed samples of three different species and different proportions, and used the mitochondrial gene CO І as a barcode to initially explore the feasibility and application prospect of DNA metabarcoding technology in rapidly identification of endangered species in complex samples.

Materials
In this study, the animal samples used in the preparation of artificial mixed samples involved 11 genus of 9 families, including China's Cand secondary key protected animals, beneficial species with economic and scientific research values and animals supervised by CITES appendix. See Table 1 for specific information. All samples were taxonomically confirmed by animal morphology experts of Forest Police Forensic Center of State Forestry Administration.  In this study, a total of three (U1-1, 1-1G, 1-1W) mixed samples were designed and prepared, each containing 9 to 11 species, and the components were mixed based on dry weight ratio (the relative concentration was 1% to 12%). Among them, the Saiga tatarica and the pangolin, which are often used for medicine, take their horns and nails as experimental materials, and these two hard materials are directly ground with a grinder; The remaining materials were taken from their muscle tissues, freeze-dried for 78 hours, and the lyophilized components were ground using an autoclaved mortar and pestle or a freeze mill, and then stored at -20℃. The individual components of the three mixed samples were weighed and thoroughly mixed by a tumbler for 20 hours and stored at -20℃ until use. The specific species composition and proportion are shown in Table 2.

Genomic DNA extraction
In this study, the genomic DNA was extracted using the TaKaRa MiniBEST Universal Genomic DNA Extraction Kit Ver.5.0. During the extraction, the sample 1-1G contained bones components such as horns and nails, so the sample were digested overnight with metal bath, and the other samples were digested for 3 to 4 hours. The concentration and quality of DNA were determined by ultraviolet spectrophotometry and agarose gel electrophoresis.

Primer design
Due to the limitation of the sequencing length of the second-generation sequencing platform, and the consideration of foods, pharmaceuticals, etc., which are highly processed products, in which genomic DNA is highly degraded, mini barcoding are more suitable. Therefore, this study used a 360bp CO І fragment as a barcode [4] , the primer sequence is as follows: CO Іmini-1: 5'-GGWACWGGWTGAACWGTWTAYCCYCC -3', CO Іmini-2: 5'-TAIACYTCIGGRTGICCRAARAAYCA -3'.

Sequencing and analysis
The high-throughput sequencing of this study was completed by Jinweizhi Biotech's illumina HiSeq X Ten sequencing platform. The original data was analyzed by the software Bcl2fastq (v2.17.1.14) for image base recognition and preliminary quality analysis; The sequencing raw data was optimized using Cutadapt (v1.9.1), Qiime (1.9.1) and Vsearch (1.9.6) software; All sequences were subjected to OTU partitioning and statistical analysis of biological information using Qiime (1.9.1) and Vsearch (1.9.6) software.

Raw data statistical analysis
Data volume and sequencing quality for three sample sequencing raw data were counted. The number of reads of the data output of the 1-1G, 1-1W and U1-1 samples were 309838, 262030 and 270238, respectively. The specific data are shown in Table 3.

Sequencing data quality optimization
In high-throughput sequencing, sequencing errors such as point mutations usually occur, and the quality of the end of the sequence is relatively low. In order to obtain higher quality and more accurate bioinformatics analysis results, it's necessary to perform optimization processing such as removing primers and linker sequences and removing chimeras from the sequencing raw data to obtain the final effective data. The valid data statistics of the optimized and filtered samples of the three sample sequencing data are shown in Table 4.

OTU analysis and species annotation
OTU is a unified mark that is artificially set for a certain taxonomic unit (genus, species, etc.) in the study of population genetics. In bioinformatics analysis, each sequence is from one species. To understand the species, genus and other information in a sample sequencing result, you need to classify the sequences. Through the categorization operation, the sequences were classified into many groups according to their similarity, and one group is an OTU. In this study, all sequences were subjected to OTU partitioning and statistical analysis of biological information at 97% similarity level. OUT cluster analysis was performed on three samples, resulting in a total of 48 OTUs. In order to obtain the species classification information corresponding to the OTU, a representative sequence is selected for each OTU, and the representative sequence is subjected to species classification annotation using the RDP classifier, thereby obtaining the species composition of each sample. Among the three samples of 1-1W, 1-1G, and U1-1, 18.68%, 21.38% and 20.69% of the reads did not obtain species annotation information. The species classification of the three samples at the family level is shown in Table  5. The species classification at the genus level is shown in Table 6. The results show that the species identification rates of the samples 1-1W and U1-1 are 100% at the family and genus level. The species identification rate of sample 1-1G is 89%; however, the classification of two non-target species of the genus Axis and Elaphurus appeared at the genus level, which indicating that CO І is not high in species resolution of deer at the level under genus. The species classification of the three samples at the species level is shown in Figure 1. The results show that the species identification rates of samples 1-1W and U1-1 are both 89%, and the species identification rate of sample 1-1G is 73%. At the species level, in addition to the non-target species notes of the deer, Mustela sibirica was also incorrectly annotated as Mustela erminea. The results of this study also showed that Selenarctos thibetanus with a content as low as 1% in the sample 1-1W were also detected, but neither of the bone components in the sample 1-1G were detected.   The column name is sample information, the row name is the species name, the tree above the graph is the sample clustering tree, and the left side of the graph is the species clustering tree. The value of each square with a different color in the middle heat map corresponds to the relative abundance value of each row.

Ddiscuses
When the DNA metabarcoding was first proposed in 2011, it was pointed out that its application value in the analysis of biodiversity. At present, this technology has been widely applied to all fields of metagenomics research, including soil [5] , ocean [6] , environmental pollution [7] , human gastrointestinal tract [8] and other directions.
In recent years, domestic and foreign scholars have also applied the DNA metabarcoding technology to the identification of mixed Chinese herbal ingredients [9] . In 2012, Coghlan et al. demonstrated the ability of DNA metabarcoding to detect species in complex Chinese medicine samples presented in the form of powders, crystals, capsules, tablets and herbal teas. Their findings showed that some Chinese medicine samples contain CITES-listed regulatory species, including Asian Selenarctos thibetanus and Saiga tatarica, as well as unlisted ingredients and potentially toxic and sensitizing plant constituents [10] . In 2014, Cheng et al. based on the Liuwei Dihuang Pill formula widely used in China, carried out macro bar code analysis on traditional Chinese medicine preparations with clear composition. The results show that there are significant differences in quality and safety between different commercial Chinese medicine preparations, because the unlisted Acanthopanax senticosus in some preparations may pose a safety risk to consumers [11] . In 2017, Raclariu et al. used DNA metabarcoding technology combined with HPLC-MS to conduct Veronica officinalis and its commonly used adulterants V. chamaedrys in 16 medicinal products of Veronica officinalis identification and identification research and found that only 15% of the products were detectable by Veronica officinalis, and 62% of the products detected mixed V. chamaedrys [12] . sequencing platform to identify and analyze the species of the commercially available Ruyi Golden Powder containing 10 herbs, and detected the ITS2 sequence of 8 herbs [13] . These studies show that the DNA metabarcoding method is a more effective method for reviewing highly processed Chinese medicine products and will help to monitor its legitimacy and safety. In this study, three mixed samples of multi-species were prepared manually. U1-1 was a mixed sample with a mass ratio of 1:1 for all components, 1-1W was a mixed sample containing trace components (1%), and 1-1G was a mixed sample containing high nose antelope horn and pangolin powder. High-throughput sequencing technology was used to evaluate and analyze the feasibility of mitochondrial CO І as a barcode in identifying endangered animals in complex biological samples. The results showed that at the level of family, genus and species, the simultaneous identification rates of the species in the mixed samples were as high as 100%, 100%, 89%, respectively, and the sensitivity was high. The specie with the content as low as 1% was detected at the same time. However, bone components with low DNA content such as animal horns and scales were not detected. In this study, although DNA metabarcoding technology based on CO І can achieve simultaneous recognition of animal species in mixed samples, and the detection rate is as high as 89%, at the species level, nearly 30% of non-target classification matches are also produced. This may be due to the fact that the mini CO І barcode is used in this study. It contains less classified information and low resolution. At the same time, the misreading of some bases introduced during the sequencing process will also lead to the classification of false positives. However, with the further development of the second-generation sequencing technology, the increase in sequencing length and the improvement of sequencing accuracy, such problems will be greatly improved.
Based on this study, DNA metabarcoding has great potential in the field of identification of Chinese medicines and identification of food-derived components. However, there are still many difficulties to overcome, such as highly processed foods, high quality DNA extraction in pharmaceuticals, low resolution of mini barcoding, and the quality and integrity of reference sequence databases, which are especially important for law enforcement issues. Therefore, the application of DNA metabarcoding to the identification of forensic wildlife samples is still in its infancy, and it faces many challenges. This study hopes to lay a preliminary foundation for the advancement of DNA metabarcoding method in the field of animal and plant forensic identification.