blastn Parameters for Noncoding Queries

blastn Parameters for noncoding queries

If one wants to look for a conserved noncoding RNA in a new genome using the best possible tools, then one should use sophisticated structure-based methods such as Klein and Eddy's RSEARCH ( BMC Bioinformatics 4:44 , PubMed), and one should always consult the RNA database Rfam (2002, PubMed; Griffths-Jones et al. 2005: Rfam: annotating non-coding RNAs in complete genomes, PubMed). However, alignment tools such as blast or fasta are more readily available and it is often expedient to use alignment when other tools would do better. In this case, one must bear in mind that the parameters appropriate for searching distant genomic sequence using a small noncoding RNA (or any other short noncoding sequence) as a query are very different from those appropriate for most uses of blast, and the default parameters are completely inappropriate. I confronted this problem at the Drosophila genome jamboree in 1999 and published the parameters I used there in the paper I wrote with Helen Salz (J. Biol. Chem.; PubMed).

Blastn parameters. This is not the place to introduce or thoroughly discuss parameters for scoring alignments in detail. I refer readers to the NCBI tutorial or an appropriate text. I recommend the book "Bioinformatics" by David Mount (no relation; Amazon or Cold Spring Harbor Press) and the O'Reilly book on blast (Amazon O'Reilly). However, a brief review is necessary. The key parameters for blastn are r and q, the reward and penalty for matches and mismatches. Simple arithmetic dictates that alignments with 40% identity (which can be significant) will only have a positive score if the absolute value of r is at least 1.5x the value of q. However, it is essential that random sequences not receive a positive score, and the number of false positives climbs rapidly whenever r exceeds q (depending, of course, on the query and the genome). That is the essential tradeoff. Similar issues involve the gap parameters G and E. G is the cost to open a gap and E is the cost to extend a gap. The value of E should usually be at least twice the value of r (the exceptions being when -q is twice r or E is greater than half the value of G). If you are confident that your alignment will not have gaps, then you can set G very high, but more usually E is set at about 40% of the value of G (G:E = 2.5). Higher values of E (lower G:E) may be appropriate when dealing with small RNAs with conserved structure and overall size. This way, insertions of one or two bases are permitted but larger gaps are not. As G and E are lowered with repect to r, larger gaps are permitted. In many cases this will allow alignments to be extended to the entire RNA, but scores obtained by false positives will climb in relation to true scores. In addition to using the parameter sets recommended here, users will probably want to make adjustments appropriate to their specific query. At the Drosophila genome jamboree in 1998 I used five parameter sets (A: -r 10 -q -11 -W 8 -G 100 -E 50 || B: -r 10 -q -11 -W 7 -G 5 -E 20 || C: -r 10 -q -11 -W 7 -G 15 -E 4 || D: -r 7 -q -14 -W 7 -G 7 -E 3 || E: -r 4 -q -5 -W 8 -G 10 -E 2).

Chau Nguyen (a Computer Science and Cell and Molecular Biology double major at the University of Maryland) investigated this in some detail as research project in my laboratory during the fall of 2003. Chau used small nuclear RNAs (human/Drosophila) as a test case, and took the ratio between the E score assigned to the known true hit and the E score assigned to the top false hit as her objective function. She found that the following parameters were nearly optimal for finding fly snRNAs using human snRNAs as queries: "-r 8 -q -5 -G 20 -E 5" She then varied each parameter separately. (Actually, because the absolute value of the parameters is meaningless, only three paramters need to be varied: r/q, r/G and G/E. From her experiments, it appears that the optimal value of r is 7 or 8; of q is -7 or -8; of G is 15 or 16 and the optimal value of E is 1 or 2. She felt, however, that this value for E is a bit too low, as it represents a G:E ratio of between 8:1 and 16:1, which is rather high.

Step-by-step instructions . These are written for use on the NCBI blastn server; changes are made in the "Options for advanced blasting" section of the input page. Of course, these parameters can also be used when running blast locally.

1) Disable the low complexity filter. Your query should be short and carefully designed. You should have already removed any low complexity motifs that might cause a problem.

2) Change W ("Word size") to 7. You will never find a match that does not include a perfect match of at least W. 7 is the smallest value available at NCBI. Many distant small RNA homologs do not have a match of this length, so it would be best to use a word size of 5 or 6 but this option is not available on NCBI. Smaller word sizes greatly increase runtime.

3) Write "-r 5 -q -4 -G 10 -E 6 " on "other advanced options." These values (parameter set F) will not only find mammalian U6atac using plant U6atac but will also provide an alignment across the entire snRNA. If you don't find what you want, you may want to make adjustments based on the logic of the forgoing discussion.

4) You may also want to adjust the value of "Expect" upwards from 10 to 100. This is OK because if you are looking for distant matches to noncoding sequences you must be able to evaluate the significance of your results using criteria other than alignment statistics. Typically these criteria would be (for an snRNA) the conservation of nucleotides that are known to be invariant and secondary structure. For noncoding DNA sequences, synteny would be the best external criterion.

5) Try some other parameter sets. I used to recommend the following.

-r 8 -q -5 -G 20 -E 5 -W 7 These (parameter set G) were Chau's starting point. However, they are pretty promiscuous, so expect many false hits. -r 7 -q -8 -G 15 -E 2 -W 7 (parameter set H)

However, as of the end of 2005 NCBI only allows certain very limited parameters to be used (Blast news 2005/12/06). The trade-off is that the E values are more meaningful.

The relevant options under the new constraints are probably these:

-r 5 -q -4 -G 10 -E 6; -r 5 -q -4 -G 8 -E 6; -r 5 -q -4 -G 25 -E 10 and
-r 4 -q -5 -G 3 -E 5; -r 4 -q -5 -G 6 -E 5; -r 4 -q -5 -G 12 -E 8

These searches require more computing resources than standard blast searches and it will generally take longer than the estimated time for your results to come back. For related reasons, you should not attempt to use these parameters for queries longer than about 500 bp.. In addition, because the assumptions that go into calculating E values are violated by these parameters, the E values reported in your output are meaningless (except as relative numbers; better matches will still have lower E values). This last point may have been addressed by NCBI recently for a small number of specific parameter sets (Blast news 2005/12/06), but it is still a good idea to avoid relying on the E values (except when comparing results obtained with the same parameter set) and I do not recommend reporting them. The lack of reliable E values is not license to believe nonsense; your results should be validated by external criteria as described above.

Good luck! If you have experience that bears on this, or can cite relevant literature, please let me know and I'll update this posting.

Email comments

SteveMount.com Science home

SteveMount.org

Copyright 2005 Steve Mount and Minh-Chau L. Nguyen. This is Posting 4.
Please cite http://www.SteveMount.com/Posting0004.html This posting may be revised in the future but this version will remain available at http://www.SteveMount.com/Posting0004.06.html