Pyrosequencing with di-base addition for the identification of pathogenic bacteria

. We proposed a novel method for the identification of pathogenic bacteria, in which a partial amplicon of the molecular markers was targeted by using pyrosequencing with di-base addition (PDBA). PDBA was conducted by synchronously adding di-base instead of one base into a reaction, and a set of highly sequence-specific encodings containing the type and the number of incorporated nucleotide(s) (peak height) were obtained. By comparing the encoding sizes of each isolate and the number of incorporated nucleotide(s) in each cycle, moving from first to last, various kinds of bacteria could be unambiguously identified. To verify its feasibility, we simulated PDBA results from thirteen isolates of Mycobacterium species and compared their encoding sizes and the number of incorporated nucleotide(s) in each cycle with those predicted by a homemade software. The thirteen isolates were successfully differentiated. We also targeted partial RNase P RNA gene ( rnpB ) of cultured M. paratubercuosis and M. celatum to differentiate the two isolates. By comparing the encoding size of each isolate and the number of incorporated nucleotide(s) in each cycle, the two Mycobacterium isolates were successfully differentiated. In conclusion, PDBA enabled to reliably identify pathogenic bacteria.


Introduction
It is important to identify pathogenic bacteria in the field of microbiological diagnostics and thus the demand for simple, sensitive, specific, and robust tests is increasing. For the identification of pathogenic bacteria, both biochemical testing and molecular methods are developed. However, biochemical testing may yield inconsistent or inconclusive results ( [1]). Molecular methods provide microbiologists with additional tools that may supplement biochemical testing for bacterial pathogen identification ( [2]). Among the molecular methods, DNA sequencing technologies, such as Sanger sequencing, pyrosequencing and next-generation sequencing (NGS), offer attractive strategies for the identification of pathogenic bacteria from clinical specimens and it is regarded as a straight way to differentiate pathogenic bacteria. Sanger sequencing is low-cost and high-accuracy. But its low throughput limits it to be used to simultaneously identify large number of pathogenic bacteria. Pyrosequencing is a sequencing-by-synthesis based method, which has been widely used to detect pathogenic bacteria by targeting molecular markers, such as rnpB ( [3]), rpoB ( [4]), 16s rRNA ( [5]), gyrB ( [6]). Traditional pyrosequencing targets species-specific nucleotide sequences by adding one nucleotide into each reaction at a time so that the type and the number of incorporated nucleotide(s) can be determined. The isolates are identified from sequence alignment of a large sequence dataset, such as GenBank and the partners of the International Nucleotide Sequence Database Collaboration ( [3]). However, it is difficult to obtain concrete sequences by using traditional pyrosequencing when the abundance of analyzed isolates is relatively low and species-specific nucleotide sequences is longer than 60 bp ( [3,7]). Furthermore, the limitations in resolution of closely related species hamper it as a routine tool for identify pathogenic bacteria. NGS is another sequencing technique enables to identify pathogenic bacteria. It is rapid and provides confident identity of any bacteria present in a sample. However, NGS is so sensitive that it can detect bacteria present in very small numbers, below the infection threshold, resulting in unnecessary antibiotic therapy ( [8]). Therefore, it is necessary to explore an additional method for the identification of pathogenic bacteria. We have proposed a novel sequencing method in which a template is cyclically sequenced twice in two parallel sequencing runs with different combinations of di-base addition (AG/CT, AT/CG, or AC/GT) and two sets of encodings are obtained ( [9]). The two sets of encodings allow for the bases to be sequentially decoded, moving from first to last, in a deterministic manner. This approach is compatible with most of the SBS-based sequencing platforms, such as traditional pyrosequencing, Roach 454, Ion torrent PGM ( [9]). When the novel sequencing technique is combined with traditional pyrosequencing, we call it pyrosequencing with di-base addition (PDBA). Due to the addition of di-base into the reaction, PDBA enables to increase signal intensity (peak height) in each sequencing run and to reduce the number of the sequencing cycles (A sequencing cycle represented a sequencing reaction) required to read though the target sequences. Thus, when specific target sequences are interrogated by PDBA, encodings containing the peak height and the number of sequencing cycles (encoding size) are sequence-specific. By comparing the encoding sizes and the peak height of species-specific nucleotide sequences with those predicted by a homemade software, the isolates can be unambiguously differentiated.
Here, we applied PDBA to differentiate pathogenic bacteria. The target for PDBA analysis was the rnpB gene coding for the RNA subunit of RNase P. The gene was present in all bacteria and has both conserved and highly variable sequence regions. It has shown to be an excellent target for differentiation of bacterial species of the diverse genera ( [3]). In the present study, take Mycobacterium species as an example, partial rnpB gene of Mycobacterium species was interrogated by PDBA. To verify the feasibility of this novel method, we first simulated PDBA results from 13 isolates of Mycobacterium species and then compared the encodings of each isolate with those predicted by use of the homemade software. In addition, to discriminate two clured Mycobacterial strains, M. paratubercuosis and M. celatum, a portion of the rnpB genes of the two isolates that contained a hypervariable region were also targeted by PDBA.

DNA preparation
DNA was extracted from two cultured Mycobacterial strains by using QIAamp DNA minikit (QIAGEN, Hilden, Germany) according to the manufacturer's instructions. The extracted DNA pellets were stored at -20•C for use.

PCR
PCR assays targeted partial rnpB gene that contained a hypervariable region was used for the sequence-based identification of Mycobacteria. A sequencing primer was designed for the most effective sequencing of the hypervariable region. The templates harboring partial rnpB gene were amplified using the following primers: Forward, 5'-CGGATGAGTTGGCTGGGCGG-3'; Reverse: 5'-biotin-GCTCTTACCGCACCGTTTCACC C -3'. Primers were synthesized by Shenzhen Huada Gene Technology Co. LTD. (Shenzhen, China). Reactions for PCR were performed using 25 ng of template DNA in a total volume of 25 μL containing a final concentration of 1× HotStar Taq DNA polymerase Buffer, 1.6 mM of additional MgCl2, 200 μM of dNTPs, 200 nM of each primer and 2 U of HotStar TaqDNA polymerase. The PCR reaction consisted of the following: an initial denaturing at 98•C for 30 s, 45 cycles of 98•C for 10 s, 56•C for 20 s, and 72•C for 15 s, followed by extension at 72•C for 5 min. PCR products were immediately stored at -20•C. Electrophoresis using 2% agarose gels was used to verify PCR products.

Conventional pyrosequencing
Twenty microliters of PCR products were purified and rendered single-stranded on a PyroMark Q96 Vacuum Workstation (Qiagen, Courtaboeuf, France) following the manufacturer's instructions. The single-stranded templates and 1 μM sequencing primer (5' -TGGCTGGGCGGCCGCGG-3') were added into a 96well filter plate and then annealed by incubating in the presence of annealing buffer for 2 min at 80•C. After annealing, the mixture was cooled to room temperature for use. Pyrosequencing analysis was carried out on an automated PSQ 96MA system (Biotage AB, Uppsala, Sweden) by using the SQA PyroMark Gold Q96 Reagents (1×96), containing enzyme, substrate, and nucleotides. The reaction cascaded primarily by cyclic supplying of the C/T/G/A into the growing DNA chains. The pattern of emitted light in relation to the type and number of nucleotides incorporated was subsequently illustrated on a pyrogram. For reproducibility, each sample was also analysed in triplicate.

Pyrosequencing with di-base addition (PDBA)
According to our previous work, the cartridge contains a total of six reagent vials. Substrates and enzymes are separately added into a reagent vial, di-bases AG, CT, AC, GT, AT, and CG are selectively added into the remaining four vials ( [7]). Compared to conventional pyrosequencing, PDBA was carried out by adding one dibase instead of one base (A, G, C, and T) into the reaction at a time. Here, di-bases AC and GT were separately placed into each reagent vial and cyclically added into the reactions. After interrogating, a set of encodings which contained the number of incorporated base(s) and the type of di-base was sequentially obtained by translating the peaks in the pyrograms.

The principle for identification of pathogenic bacteria using PDBA
We have proposed a real-time decoding sequencing strategy in which a template is cyclically interrogated in two parallel sequencing runs with two orders of di-base dispensation. The two orders of di-base dispensation are randomly chosen from the three di-base dispensation orders AG/CT, AT/CG and AC/GT. After two parallel sequencing runs, two sets of encodings containing the type and the number of incorporated nucleotide(s) are obtained. By decoding the two sets of encodings obtained, moving from first to last, in a deterministic manner, the quired sequence allows to be sequentially determined ( [10]). Interestingly, for each set of Di-bases AT and CG were sequentially added into each reaction. *: Encoding CG 1 represented that di-base CG was added into sequencing reaction and one nucleotide C or G was incorporated into the templates during sequencing reaction. The organisms highlighted in gray represented the species with identical encoding size. encodings, the encoding size and the number of incorporated nucleotide(s) in some sequencing cycles are high sequence-specific. By comparing the encoding size of each isolate and the number of incorporated nucleotide(s) in each sequencing cycle from one sequencing run, different sequences allow to be differentiated. That is, different sequences can be differentiated by using only one sequencing run without decoding the encodings. Here, we applied only one sequencing run instead of two parallel sequencing runs for the identification of pathogenic bacteria in this study.
Generally, to identify pathogen bacteria, partial species-specific nucleotide sequences (molecular markers), such as 16s rRNA ( [11,12]), rnpB ( [13]), gyrB ( [14]), hsp65 ( [15]), from pathogenic bacteria can be targeted. Conventional pyrosequencing has been widely used to detect pathogenic bacteria by targeting these molecular markers. PDBA also enables to identify pathogenic bacteria by analysis of species-specific nucleotide sequences in a sequencing run. Here, as an example, rnpB gene was used. The following items introduced how to use PDBA for the identification of pathogenic bacteria. Firstly, the encodings of each species-specific sequence from PDBA with three di-base dispensation orders AG/CT, AT/CG and AC/GT were predicted by use of the homemade software, respectively. Several sets of encodings were obtained for later comparison. Secondly, the PCR amplicons from isolates were targeted by using di-base dispensation orders AG/CT, AT/CG or AC/GT, and the PDBA pyrograms were translated into a set of encodings which contained the number of reaction cycles (encoding size), the type and the number of the incorporated nucleotide(s) in each sequencing cycle. Second, Compare the encoding size of each isolate with that predicted by software. If the encoding size of the quired isolates was the unique, the isolates could be unambiguously differentiated. If the encoding sizes of the quired isolates were identical, peak height in each sequencing cycle was compared, moving from the first c to the last cycle, until it was not identical. There was one item need to be emphasized. For any set of isolates, we were able to find at least one order of di-base addition at which the encodings from each isolate were sequence-specific. All in all, by comparing each set of encodings with that of the predicted encodings, from the first to the last cycle, all the isolates could be identified.

The proof-principle experiment
To verify the feasibility of PDBA for the identification of pathogenic bacteria, partial rnpB gene of thirteen smegmatis) were downloaded from Genbank and their encodings obtained by PDBA were predicted with the homemade software. As for each isolate, we predicted PDBA results with three kinds of di-base dispensation order (AG/CT, AC/GT and AT/CG) by use of the software, and a total of thirty-nine sets of encodings was obtained. We gave up the orders of di-base dispensation at which different strains would obtain the same set of encodings, making these strains indistinguishable. Therefore, the thirteen Mycobacterial isolates were interrogated by use of cyclical supplying of AT/CG. Table  1 showed their predicted encodings and their encoding sizes. Each encoding contained the type of and the number of incorporated nucleotide(s) in the reaction. Take encoding CG 1 as an example, encoding CG 1 represented that di-base CG was added into the sequencing reaction and one nucleotide C or G was incorporated into the templates, producing peak hight equal to 1. By comparing the encoding size of each isolate from the pyrogram with that of predicted encodings, we found that four Mycobacterial isolates with unique encoding size (26, 15, 18, and 6). By comparing the four encoding sizes with that predicted by software (The simulated pyrograms of these four isolates were not shown), we found that the encoding sizes of M. celatum, M. kansasii, M. xenopi, and M. intracellular were 26, 15, 18 and 6, respectively. Therefore, the four Mycobacterial isolates were successfully discriminated. The isolates with the same encoding size were grouped together and highlighted in gray in table 1. There were three groups with the encoding sizes of 20, 16, and 8, respectively. The simulated pyrograms from PDBA of the three group isolates were illustrated in Fig. 1. For the first group, each isolate required twenty sequencing cycles to read through the partial rnpB sequence (Fig. 1A-C). That meant the encoding size of each isolate in Fig.  1A-C was 20. However, for the three isolates, the number of incorporated nucleotides in some sequencing cycles might be different from each other. By comparing encodings from the pyrograms with that predicted from the software in each cycle, from the first to the last, all the isolates could be identified. For example, the number of incorporated nucleotides in the first cycle was 1, 4 and 1, respectively. When the three sets of encodings were compared with those of expectant encodings from software, M. gordonae incorporated 4 nucleotides in the first cycle during the pyrosequencing reaction. Thus, Fig.  C represented pyrogram from M. gordonae. In Fig. 1A and B, the number of incorporated nucleotides in the first and second cycle were the same. However, the number of incorporated nucleotides in the third cycle was 4 and 2, respectively. When the two sets of encodings were compared with those of expectant encodings from software, M. marinum and M. leprae incorporated 4 and 2 nucleotides in the third cycle. Thus, Fig. 1A and B were pyrosequencing results from M. marinum and M. leprae. That was, by using only three pyrosequencing cycles, three Mycobacterial isolates were successfully discriminated.
The remaining two groups could be detected by analogy. For the second group, the encoding size of each isolate was 16 ( Fig. 1D and E). The number of incorporated nucleotides in the first and second cycles were the same whereas the number of incorporated nucleotides in the third cycle was 2 and 4, respectively. When the two sets of encodings were compared with those of expectant encodings from software, M. gastri and M. malmoense incorporated 2 and 4 nucleotides in

PDBA for the identification of M. paratubercuosis and M. celatum
The rnpB gene consists of approximately 400 nucleotides and has some regions to vary greatly between species in both nucleotide composition and size, whereas others are conserved for retained enzymatic activity ( [16]). The hyperconservable and hypervariable are useful for identifying pathogenic bacteria. In this study, the PDBA results from partial rnpB gene of M. paratubercuosis and M. celatum by use of three sets of di-base dispensation order AG/CT, AC/GT, and AT/CG were predicted by the software ( Table 2). As shown, each set of the encodings enabled to clearly discriminate the two isolates, thus AC/GT was randomly chosen for use. Simultaneously, DNA from M. paratubercuosis and M. celatum was extracted and amplicons containing partial sequences of rnpB gene were obtained by PCR. The amplicons were then analyzed by using PDBA with di-base dispensation order AC/GT (Fig. 2). The sequencing results were translated into a set of encodings and each encoding upon the peak was consistent with what we predicted previously. As could be seen from Fig. 2A and B, the two queried isolates required a total of 14 cycles to read through the targeted sequences. However, the number of incorporated nucleotides (peak height) in some cycles were different from each other. From the first to the third cycles, both M. paratubercuosis and M. celatum incorporated one nucleotide during the pyrosequencing reactions. However, in the fourth cycle, isolate in Fig. 2A incorporated 2 nucleotides during the pyrosequencing reaction while isolate in Fig. 2B incorporated 4 nucleotides. When the two sets of encodings were compared with those of the predicted encodings from software, M. paratubercuosis and M. celatum incorporated 2 and 4 nucleotides in the third cycle during the pyrosequencing reaction. Thus, Fig. 2A and B were PDBA results from M. paratubercuosis and M. celatum, respectively. Therefore, by using only 4 cycles, the two Mycobacterial isolates, M. paratubercuosis and M. celatum, were clearly discriminated.
We also targeted the partial rnpB gene sequences from M. paratubercuosis and M. celatum via traditional pyrosequencing ( Fig. 2C and D). Conventional pyrosequencing enabled to directly show the type and the number of incorporated nucleotides in each cycle in the pyrograms. As shown, to read through the partial rnpB gene sequences of the two Mycobacterial isolates, conventional pyrosequencing applied 35 cycles while PDBA applied 14 cycles. Furthermore, from the first to the nineth cycle, the peak patterns of the two isolates were the same, but they were not identical until the 10th cycle, which meant conventional pyrosequencing required at least 10 sequencing cycles to discriminate the two isolates. That was, when a random space-specific sequence was analyzed by both conventional pyrosequencing and PDBA, respectively, conventiaonl pyrosequencing approximately needed more cycles to read through it. In addition, due to simultaneous addition of two nucleotides at a time, the signal intensities from PDBA were relatively higher than that from traditional pyrosequencing. This was useful for identifying pathogenic bacteria with lower concentration. When the concentration of isolates was lower, the signal intensity would be lower and lower with the extension of sequencing primers, resulting in more and more false si gnals. When two nucleotides were simultaneously added into the reaction cycles, the signal intensity in each cycle was amplified, making lowabundance sequences detect able. Notably, strong background could be seen in Table2. The predicted encodings of PDBA with the three sets of di-base dispensation order AG/CT, AC/GT, and AT/CG.  Fig. 2C and D. It might result from the low concentration of single-strand PCR products. However, when those amplicons were analyzed by PDBA, clear peaks with low background and high quality were observed ( Fig. 2A and  B). An inherent problem of PDBA, also seen in conventional pyrosequencing, was the imperfect resolution of homopolymers, leading to a decreased reproducibility of the pyrosequencing assay used for microbiology identification. However, this problem might be avoided by choosing a proper order of di-base dispensation for PDBA. In PDBA, nucleotides could be flowed in the three forms, AG/CT, AC/GT, AT/CG. If homopolymers were appeared when AG/CT was applied, we could most likely find one form of di-base dispensation from AC/GT and AT/CG which did not obtain the encodings with more than five identical incorporated nucleotides.

Conclusions
We have described a novel method entitled pyrosequencing with di-base addition (PDBA) for identification of pathogenic bacteria which was conducted by the addition of di-bases into sequencing reactions. Sequencing results were translated into a set of encodings containing the type and number of incorporated nucleotide(s).The number of incorporated nucleotide(s) in each cycle and the encoding size of each isolate were highly sequence-specific. Thus, by comparing the encoding size of each isolate and the number of incorporated nucleotide(s) in each cycle, moving from first to last, various kinds of bacteria could be unambiguously identified. We also have simulated PDBA results from thirteen isolates of Mycobacterium species. By comparing encoding sizes of each isolate and the number of incorporated nucleotide(s) in each cycle, the thirteen Mycobacterium isolates were successfully differentiated. To verify its availability in cultured isolates, partial rnpB gene of M. paratubercuosis and M. celatum were clearly analyzed. This method applied fewer sequencing cycles to obtain longer sequences. Thus, it will reduce the cost of bacteria identification and improve the analysis accuracy.