Prediction of off-target effects of the CRISPR/Cas9 system for design of sgRNA

. CRISPR/Cas9 genome editing technology is the frontier of life science research. They have been used to cure human genetic diseases, achieve cell personalized treatment, develop new drugs, and improve the genetic characteristics of crops and other fields. This system relies on the enzyme Cas9 cutting target DNA (on target) under the guidance of sgRNA, but it can also cut non-target sites, which results in off-target effects, thus causing uncontrollable mutations. The risk of off-target effect in CRISPR technology is the main limiting factor that affects the widespread application of CRISPR technology. How to evaluate and reduce the off-target effect is the urgent problem to be solved. In this work, we build up a model that can predict the score of being off-target. Through comparison with the complete genome of the target and precise mathematics that calculate the potential risk of being off-target, we optimize the sgRNA, which is capable of reducing the off-target effect. The result has proven that we can efficiently and quickly identify and screen the best editing target sites with our model. The CRISPR/Cas9 system, not even being perfected yet, has already demonstrated its potential in the field of genome editing. Hopefully through our model, the CRISPR/Cas9 system can quickly apply to more branches in life science and cure those diseases that have been previously incurable.


INTRODUCTION
Targeted genome editing has become an increasingly popular field in bioengineering as more methods of the editing have been discovered and can be applied to a wider range of organisms [1][2][3] One method, known as the clustered regularly interspaced short palindromic repeats (CRISPR) associated 9 system (Cas9), has recently been discovered and applied to edit the genome of various organisms 4,5 . It is the third generation of enzyme-related gene editing tool, right after zinc-finger nucleases (ZFN) and transcription activator-like effectors nucleases (TALENS) 3 . The CRISPR/Cas9 system was initially discovered as the "immune system" for bacteria and archaea, which uses this strategy to fight against viruses that inject DNA into them [6][7][8][9] . The system has a prominent aspect as it only requires designing a singleguided RNA (sgRNA), which usually consists of twenty base pairs, that is capable of leading the Cas9 enzyme to bind to the site where the DNA sequence needs to be modified [10][11][12][13][14] . The sgRNA also helps identify with a specific sequence, called the protospacer adjacent motif (PAM), which is crucial in locating the segment of gene which needs modifications [15][16][17][18] . Thus, the CRISPR/Cas9 system can target any locus in any given genome [19][20][21][22] . This method has become more widely used because of its accuracy, relatively cheap costs, and convenience [23][24][25][26] . Other potential usages of such method include killing viruses and cancer cells [27][28][29][30] . However, this method has not been perfected as the limitation of 20 nucleotides in the sgRNA exposes this method to the possibility of being off-target [31][32][33][34] . Since it only contains 20 nucleotides, the same sequence is vulnerable to repetition within the genome and therefore the Cas9 enzyme has the possibility of being guided to the wrong locus, which in turn leads to mutations [35][36][37][38][39] .
Therefore, we will explore deep into the CRISPR/Cas9 system, and analyze statistical data from experiments to generate a relatively simple model that can be used to calculate the score of the mismatches in order to reduce the off-target effect.

Methods
1. Design a single-guided RNA sequences that helps guiding the Cas9 to the targeted genome. (Figure 1) 2. Compare the targeted genome with the database of gene sequences. 3. Identify the repeating sequences of DNA with the targeted segment of gene in the targeted genome. 4. Identify the positon of the repeating bases. 5. Input the position into the designated formula. 6. For multiple repeating strands of DNA, we take the strand which has the highest score as the final mismatch score. 7. After gaining all the score values, the sgRNA sequence with the highest score value will be selected. Fig. 1. Workflow of our model. The input sequence is scanned to identify sgRNA target sites according to the parameters specified. For each candidate target site, the potential off-target sites are determined using Bowtie1. Each potential off-target and its score is computed. With this information each candidate is ranked and finally the results are provided.

Results and discussion
From any provided DNA sequence like green fluorescent protein (GPF) all sgRNA target sites will be identified according to parameters like the type of PAM (Figure 2). For off-target predictions, the selection of a PAM type should be made separately (Figure 2). Experimental evidence indicates Cas9 nuclease activity strongly correlating with the mismatch position. Mismatches close to the PAM site will most likely abolish the introduction of a double strand break, while more distant mismatches are tolerated 40,41 . We incorporate these findings as a parameter, and based on the experimental data, we build a mathematical model to calculate the probability of editing success when different bases are mismatched at different positions. Before implementing the model, we input the target sequence of DNA into the database to identify similar sequence that exist in the genome. After this, we start designing the sgRNA so that we can compare the offtarget score between the distinguished combinations of sgRNA. To calculate the off-target effect, we define the position of each base of the sgRNA. The base most far away from the PAM sequence is defined as 20, and the position decreases along with the distance from the PAM sequence, measured in units of base. By defining the positions, we utilize the fact that when the similar sequence exists near the PAM, the possibility of being off-target increases significantly. Through the experimental data (Supporting tables 1-2) about the percent activeness relating to the position of the nucleotide bases, a graph can be simulated in supporting table 3-6 42 . Though the data only provide 20 position, it is enough information to formulate a graph with it (Figure 3). After the data is provided, then the information is inputted into excel to generate a line graph. The y-axis shows the percent active in the sgRNA for each mismatch bases, and the x-axis represents the position.
There are four graphs being derived through excel according to different bases. It is worth noticing that the general trend of the percent activeness of the sgRNA is decreasing as the mismatch position is getting closer to the PAM. Therefore, it verifies our hypothesis that mismatch bases affect the off-target rate the highest when the base is closer to PAM, and as the mismatch base is further away from the PAM sequence, it will be less effective.
To find a graph to fit our hypothesis, the formula Y A = -0.0052X + 0.4617, Y T = -0.0075X + 0.7642, Y G = -0.0319X + 0.8741, Y C = -0.0144X + 0.7698 are calculated (Figure 3. b)-e), Supporting tables 3-6). The Y is activeness the candidate site, and the P is the position of the mismatch. The Y provides a value of activeness to each mismatch bases at the different position. If the candidate site have more than one base mismatch, P i = Y A *Y T * Y G * Y C. P i is the activeness of this candidate site. If the candidate site have more than one base strand mismatch, P Ti =MIN(Pi), Finanlly, output the MAX(P Ti ), then, the candidate is the best one and have the largest activeness. Since there are more than one mismatch bases for many cases, the final score will be the product of all the value of the mismatch bases, which is calculated through the formula we derived. Therefore, it shows that as the final score value should be as high as possible, meaning that it has a higher percent activeness, to ensure that the single guided RNA strands are safer for application.
After processing, a results page (Figure 4) is displayed containing the input parameters, the query sequence with the identified sgRNA target sites and a full list of all candidates ranked by taking into account the number of total off-target sites. Our model provides all the information necessary to swiftly identify the best candidate sgRNA represented by the order of the sgRNA target sites. Detailed information is provided for each possible off-target site: genomic coordinates, target sequence with highlighted mismatches and its corresponding name.

Conclusion
By building this model, we can predict the off-target effects of the CRISPR/Cas9 system in the design of sgRNA. With knowing the off-target effects to each corresponding sgRNA, it can tell us whether the that sgRNA is applicable or not. The model can eliminate the sgRNA that have high off-target rates, and allow us to select the ones that have lower off-target rate. The model can improve the efficiency in the process of designing the sgRNA, and it can also reduce the probability of failure. Meanwhile, the model is simple as everything needed is to calculate the mismatch score of each base with the modified function and multiply the scores of mismatched bases. The final result is a direct indication of the applicability of sgRNA.
Through this model, as introduced, we wish to widen to applicability of the CRISPR/Cas9 system as it is a E3S Web of Conferences 185, 04018 (2020) ICEEB 2020 http://doi.org/10.1051/e3sconf/202018504018 technique that possess huge potentials. For instance, the first genetically modified human babies are edited by CRISPR, and the babies are immune towards the HIV disease, which is currently incurable by any means. Despite the ethical issues, the CRISPR technique has actually displayed itself as an advanced gene-modifying method that will be widely used in the future in order to cure more diseases without interfering with ethical issues. Hopefully our model can contribute to the progress of CRISPR becoming a world-wide technique with high efficiency and accuracy by means of eliminating the possibility of off-target effect through deliberate selection of sgRNA.