Optimization of a coronavirus genus recognition procedure based on the n-gene of prototypic strains

. The article offers a solution to the problem of fast and efficient recognition of the coronavirus genus. For this purpose, the authors apply a virus genome targeting method based on the use of a sufficiently short and conserved N-gene of the nucleocapsid protein. Comparison of the codon frequency distributions in the N-gene of the analyzed genome and a set of 67 prototypical strains corresponding to the coronavirus subgenus allows us to recognize the genus of the coronavirus. This paper proposes optimization of the genus recognition of coronavirus by eliminating a significant number of codons from the 64 codons of the genetic code (26 in one case and 57 in the other). The authors achieved 100% genus recognition efficiency in a sample of 2,051 coronavirus genomes from the GenBank database with annotated subgenus in the optimized procedure. The authors also achieved 99% confidence when using the optimized coronavirus genus recognition procedure in a total sample of 3,242 genomes.


Introduction
Coronaviruses (Coronaviridae) are a large family of RNA-containing viruses capable of infecting humans and animals. The COVID-19 pandemic caused by SARS-CoV-2 has brought renewed attention to coronaviruses and their evolutionary potential [1]. High mutation and recombination rates contribute to the genotypic and phenotypic variability of coronaviruses [2]. Many scientists have studied the transmission of the virus from the main host population to other animal species and humans at the molecular genetic level [3]. Modern methods of analyzing viral genomes can create a system for monitoring gene pools of viral populations in unique ecosystems [4].
A reasonably fast and cost-effective targeted sequencing approach helps in tracking mutant strains of SARS-CoV-2 and other coronavirus species [5]. Here, only the sequence of the most variable fragment of the S-protein gene responsible for binding to cellular receptors during infection was determined. We can also propose a targeting approach for the rapid identification of coronavirus species, genus, and subgenus based on separately isolated fragments of the viral genome.
Work [6] examined various approaches to the recognition of the coronavirus genus (CoVs) based on the genomes of prototypical coronavirus strains. The authors characterized the coronavirus genome by the distribution of codon frequencies of its individual structural and nonstructural genes: the M gene for membrane protein, the S gene for spike protein, the N gene for nucleocapsid protein, and the ORF1ab gene encoding several nonstructural proteins. Scientists divided each of the four genera (α-CoV, β-CoV, δ-CoV, γ-CoV) into subgenera that also have several prototypic strains. The coronavirus genus was characterized by an analytical, i.e. averaged, distribution of codon frequencies of all its prototypic strains. In the variant approach [6] to genus recognition, each variant (regarded as a target) relied on different combinations of structural and non-structural genes. The variant based on the Ngene of the nucleocapsid protein showed the best result in recognizing the genera β-CoV, δ-CoV, and γ-CoV. Averaging the codon frequency distributions of prototypic strains of a significant number (12) subgenera in α-CoV caused a low recognition efficiency of this genus.
This paper was aimed at fast and reliable identification of the genus from a fragment (target) of the coronavirus genome. The authors chose the N-gene among the previously identified targets as the most conservative and significantly shorter (~1200 nucleotides) compared to other genes. The N-gene encoded nucleocapsid protein has functions related to viral pathogenesis, transcription and replication. Scientists often use the N-gene for molecular diagnostics of CoVs [7][8][9].
We moved from recognizing the genus of coronavirus based on analytical (averaged) codon frequency distributions [6] to recognition based on individual frequency distributions in the N-gene of prototypic subgenus strains. Determining the subgenus of a coronavirus automatically determines its genus. We investigated the dependence of recognition efficiency on the use of different codon groups in the N-gene. The groups were formed according to selected levels of codon frequencies averaged over the N-gene of 67 prototypic strains of all four coronavirus genera. We identified the two most efficient groups of codons, including 38 codons in one case and 7 codons in the other. Reducing the number of codons used optimized the recognition procedure. This optimization improved the efficiency of coronavirus genus recognition compared to recognition based on the analytical mean distributions of the coronavirus genus of the paper [6].

Materials
We used coronavirus genomes from four genera from the GenBank database: 3242 pieces, of which for 2051 genomes a subgenus was specified beside the genus. We used the codon distributions in the N genes of the prototypic strains from GenBank in the genus recognition procedure. Table 1 contains the access codes of the prototypical coronavirus strains in GenBank. Continuation of Table 1.

γ-CoV
Cegacovirus (γ01) EU111742 1 KF793826 2 Next to the name of the subgenus in parentheses is its designation in this paper (Table 1). We considered 67 prototypic strains, each of which is indexed in its subgenus. Work [10] provides details of species with prototypic coronavirus strains.

Methods
With the high mutation rate of viruses, it would be interesting to assess which N-gene codons determine whether a coronavirus belongs to a subgenus. This raises the question of a reduction in the number of codons in the N-gene of a coronavirus to recognize its subgenus. We selected the N-gene codons relative to their average frequency of occurrence in all subgenus of the prototype strains (see Table 1). Figure 1 shows a graph of the average frequency of codons in the N-gene of prototypic coronavirus strains. Thirty-eight codons occur with frequencies above the threshold shown by the dashed line. Seven codons occur with frequencies above the threshold shown by the dotted line. Selecting a codon frequency threshold means that only those codons whose frequency exceeds the threshold are suitable for recognition. Figure 1 shows two such thresholds and the number of codons whose frequencies exceed each of them with different horizontal lines. Together with the codons, it shows the amino acids they encode along the horizontal axis.
Let the chosen threshold leave the set of { , , , } S      is the symbol for the genus of the coronavirus, j is the two-digit subgenus number in the genus, and k is the index of the strain in this subgenus according to Table 1. For example, 12, 4  is the notation of the prototype strain AY994055 with an index k = of the subgenus Tegacovirus from α-CoV. Then where the coefficient 1 7 is chosen for ease of presentation of results.
Having got the distribution deviations (1) from all the distributions of the prototypical strains, we choose the strain , Sj k with the minimum deviation. We hypothesize that the coronavirus genome in question belongs to subgenus Sj and thus to genus S (see Table 1).

Results and discussion
We denote ( ) M r the number of errors in the coronavirus's recognition subgenus using a set of codons r C and formula (1) for the 2051 coronavirus genome annotated by the subgenus.   Figure 3 shows the codons that are part of the aggregates 38 С and 7 С . Note that there are no genus recognition errors out of 10 for subgenus recognition based on 38 codons. When subgenus recognition is based on 7 codons out of 19 errors, there are 14 errors in genus recognition of coronavirus.
If we consider a combined sample of genus and subgenus annotated (2051 genomes) and genus-only annotated (1191 genomes) coronavirus genomes, recognition of coronavirus genus based on 38 codons will lead to 10 errors and recognition based on 7 codons to 34 errors.
The proposed methods for subgenus and genus recognition of coronaviruses yield at least 99% confidence.

Conclusions
When using the above two approaches for genus recognition based on N genes, out of 2051 genomes annotated for the subgenus in the first case (with averaging), we noted 43 errors (of which 19 were incorrectly determined by the genus coronavirus) and in the second (without averaging); we noted 36 errors, of which only three were incorrectly determined by the genus. This result shows that the use of analytical (averaged) codon frequency distributions of prototypic strains is less effective for recognizing both the subgenus and genus of coronaviruses.